'How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFS?
How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFS?
df.show() --> 2 rows
+------+--------------+----------------+
| name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa| null| [3, 9, 15, 20]| | Ben| red| []|
+------+--------------+----------------+
df.rdd.getNumPartitions() - it has 1 partition
>>> df.rdd.getNumPartitions()
1
df.write.save("/user/hduser/data_check/test.parquet", format="parquet")
If I use the above command to create parquet file in HDFS, it is creating directory "payloads.parquet" in HDFS and inside that directory multiple files .parquet file, metadata file are getting saved.
Found 4 items
-rw-r--r-- 3 bimodjoul biusers 0 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_SUCCESS
-rw-r--r-- 3 bimodjoul biusers 494 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_common_metadata
-rw-r--r-- 3 bimodjoul biusers 862 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_metadata
-rw-r--r-- 3 bimodjoul biusers 885 2017-03-15 06:47
/user/hduser/data_check/test.parquet/part-r-00000-f83a2ffd-38bb-4c76-9f4c-357e43d9708b.gz.parquet
How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFSrather than folder with multiple files?
Help would be much appreciated.
Solution 1:[1]
This should solve the problem.
df.coalesce(1).write.parquet(parquet_file_path)
df.write.mode('append').parquet("/tmp/output/people.parquet")
Solution 2:[2]
Use coalesce(1) after write. it will solve your issue
df.coalesce(1).write
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | SRIDHARAN |
| Solution 2 |
