'Difference between repartition(1) and coalesce(1)

In our project, we are using repartition(1) to write data into table, I am interested to know why coalesce(1) cannot be used here because repartition is a costly operation compared to coalesce.

I know repartition distributes data evenly across partitions, but when the output file is of single part file, why can't we use coalesce(1)?



Solution 1:[1]

coalesce has an issue where if you're calling it using a number smaller than your current number of executors, the number of executors used to process that step will be limited by the number you passed in to the coalesce function.

The repartition function avoids this issue by shuffling the data. In any scenario where you're reducing the data down to a single partition (or really, less than half your number of executors), you should almost always use repartition over coalesce because of this. The shuffle caused by repartition is a small price to pay compared to the single-threaded operation of a call to coalesce(1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nolan Barth