About 7,330 results
Open links in new tab
  1. Spark - repartition () vs coalesce () - Stack Overflow

    Jul 24, 2015 · Is coalesce or repartition faster? coalesce may run faster than repartition, but unequal sized partitions are generally slower to work with than equal sized partitions. You'll …

  2. pyspark - Spark: What is the difference between repartition and ...

    Jan 20, 2021 · It says: for repartition: resulting DataFrame is hash partitioned. for repartitionByRange: resulting DataFrame is range partitioned. And a previous question also …

  3. apache spark sql - Difference between df.repartition and ...

    Mar 4, 2021 · What is the difference between DataFrame repartition() and DataFrameWriter partitionBy() methods? I hope both are used to "partition data based on dataframe column"? …

  4. apache spark - repartition in memory vs file - Stack Overflow

    Jul 13, 2023 · repartition() creates partition in memory and is used as a read() operation. partitionBy() creates partition in disk and is used as a write operation. How can we confirm …

  5. Why is repartition faster than partitionBy in Spark?

    Nov 15, 2021 · Even though partitionBy is faster than repartition, depending on the number of dataframe partitions and distribution of data inside those partitions, just using partitionBy alone …

  6. Difference between repartition (1) and coalesce (1) - Stack Overflow

    Sep 12, 2021 · The repartition function avoids this issue by shuffling the data. In any scenario where you're reducing the data down to a single partition (or really, less than half your number …

  7. apache spark - How can I enforce repartitioning of a pyspark …

    Jun 28, 2025 · In my output after using repartition(), Bob &Amy are in the same partition. I have a table with around 14000000+ rows that I need to write in excel files such that each excel file …

  8. Spark repartitioning by column with dynamic number of partitions …

    Oct 8, 2019 · Spark takes the columns you specified in repartition, hashes that value into a 64b long and then modulo the value by the number of partitions. This way the number of partitions …

  9. Spark parquet partitioning : Large number of files

    Jun 28, 2017 · The solution is to extend the approach using repartition(..., rand) and dynamically scale the range of rand by the desired number of output files for that data partition.

  10. bigdata - What is the difference between spark.shuffle.partition …

    Dec 10, 2022 · TLDR - Repartition is invoked as per developer's need but shuffle is done when there is a logical demand I assume you're talking about config property …