'pyspark dataframe writing csv files twice in s3

I have created a pyspark dataframe and trying to write the file in s3 bucket in csv format. here the file is writing in csv but the issue is it's writing the file twice(i.e., with actual data and another is with empty data). I have checked the data frame by printing fine only. please suggest any way to prevent that empty wouldn't create.

code snippet:

df = spark.createDataFrame(data=dt1, schema = op_df.columns)
df.write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)

pyspark apache-spark-sql

Solution 1:^[1]

One possible solution to make sure that the output will include only one file is to do repartition(1) or coalesce(1) before writing. So something like this:

df.repartition(1).write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)

Note that having one partition doesn't not necessarily mean that it will result in one file as this can depend on the spark.sql.files.maxRecordsPerFile configuration as well. Assuming this config is set to 0 (the default) you should get only 1 file in the output.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Guy

'pyspark dataframe writing csv files twice in s3

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]