'How to EFFICIENTLY upload a a pyspark dataframe as a zipped csv or parquet file(similiar to.gz format)

I have 130 GB csv.gz file in S3 that was loaded using a parallel unload from redshift to S3. Since it contains multiple files i wanted to reduce the number of files so that its easier to read for my ML model(using sklearn).

I have managed to convert multiple from from S3 to a spark dataframe (called spark_df) using : spark_df1=spark.read.csv(path,header=False,schema=schema)

spark_df1 contains 100s of columns (features) and is my time series inference data for millions of customers IDs. Since it is a time series data, i want to make sure that a the data points of 'customerID' should be present in same output file as I would be reading each partition file as a chunk. I want to unload this data back into S3.I don't mind smaller partition of data but each partitioned file SHOULD have the entire time series data of a single customer. in other words one customer's data cannot be in 2 files.

current code: datasink3=spark_df1.repartition(1).write.format("parquet").save(destination_path)

However, this takes forever to run and the ouput is a single file and it is not even zipped. I also tried using ".coalesce(1)" instead of ".repartition(1)" but it was slower in my case.

python pyspark apache-spark-sql

Solution 1:^[1]

This code worked and the time to run reduced to 1/5 of the original result. Only thing to note is that make sure that the load is split equally amongst the nodes (in my case i had to make sure that each customer id had ~same number of rows)

spark_df1.repartition("customerID").write.partitionBy("customerID").format("csv").option("compression","gzip").save(destination_path)

Solution 2:^[2]

You can partition it using the customerID:

spark_df1.partitionBy("customerID") \
         .write.format("parquet") \
         .save(destination_path)

You can read more about it here: https://sparkbyexamples.com/pyspark/pyspark-repartition-vs-partitionby/

Solution 3:^[3]

adding to manks answer, you need to repartition the DataFrame by customerID and then write.partitionBy(customerID) to get one file per customer. you can see a similar issue here.

Also, regarding your comment that parquet files are not zipped, the default compression is snappy which has some pros &cons compared to gzip compression but it's still much better than uncompressed.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Aadil Rafeeque
Solution 2
Solution 3	walking

'How to EFFICIENTLY upload a a pyspark dataframe as a zipped csv or parquet file(similiar to.gz format)

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]