'How to save a pyspark dataframe into 1000 parts by one of the columns?
I am using pyspark, and I want to save a dataframe divided into 1000 parts by one of the columns. The dataframe I want to save:
df = spark.sql("SELECT * FROM table WHERE first_date='2021-10-31' and second_date >= '2013-01-01' and type in ('sh', 'la')")
I tried to use bucketby:
df.write.bucketBy(1000, 'par').saveAsTable('temp', format='parquet', path='s3://mybucket')
but it didnt work (after a lot of reading I understand why).
I've also tried:
df.repartition(10, 'par').write.format("parquet").bucketBy(10, 'par').saveAsTable("my_table5", path='s3://mybucket/')
and it didn't work.
Is there a way to save a pyspark dataframe in a fixed number of parts divided by one of the columns in the dataframe?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|