'How to have a single csv file after applying partitionBy in Pysark

I have to first partition by a "customer group" but I also want to make sure that I have a single csv file per "customer_group" . This is because it is timeseries data that is needed for inference and it cant be spread across multiple files.

i tried: datasink2 = spark_df1.write.format("csv").partitionBy('customer_group').option("compression","gzip").save(destination_path+'/traintestcsvzippartitionocalesce') but it creates mutilpe smaller files inside customer_group/ path with formats csv.gz0000_part_00.gz , csv.gz0000_part_01.gz ....

i tried to use :datasink2 = spark_df1.write.format("csv").partitionBy('customer_group').coalesce(1).option("compression","gzip").save(destination_path+'/traintestcsvzippartitionocalesce') but it throws the following error: AttributeError: 'DataFrameWriter' object has no attribute 'coalesce'

Is there a solution?

I cannot use repartition(1) or coalesce(1) directly without the partition by as it creates only 1 file and only one worker node works at a time(serially)and is computationally super expensive



Solution 1:[1]

The repartition function also accepts column names as arguments, not only the number of partitions. Repartitioning by the write partition column will make spark save one file per folder.

Please note that if one of your partitions are skewed and one customer group has a majority of the data you might get into performance issues.

spark_df1  \
.repartition("customer_group")  \
.write \
.partitionBy("customer_group") \
...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 walking