'Preserve row order on parquet file
It appears that parquet files do not preserve the order of rows. Per instance, I am trying to pass hands-on with code
from pyspark.sql import SparkSession, Row
spark = SparkSession.builder.getOrCreate()
Header = Row("ID","Name","Age","Area of Interest")
u1 = Header("1","Jack",22,"Data Science")
u2 = Header("2","Luke",21,"Data Analytics")
u3 = Header("3","Leo",24,"Micro Services")
u4 = Header("4","Mark",21,"Data Analytics")
data = [u1,u2,u3,u4]
df = spark.createDataFrame(data)
age = df.describe("Age")
age.write.parquet("Age")
age.show()
sort = df.select("ID","Name","Age").orderBy("Name",ascending=False)
sort.write.parquet("NameSorted")
sort.show()
Showing the resutl of
+-------+------------------+
|summary| Age|
+-------+------------------+
| count| 4|
| mean| 22.0|
| stddev|1.4142135623730951|
| min| 21|
| max| 24|
+-------+------------------+
+---+----+---+
| ID|Name|Age|
+---+----+---+
| 4|Mark| 21|
| 2|Luke| 21|
| 3| Leo| 24|
| 1|Jack| 22|
+---+----+---+
However, when saved parquet file is read, the order is no more
df = spark.read.parquet("Age")
df.show()
+-------+------------------+
|summary| Age|
+-------+------------------+
| stddev|1.4142135623730951|
| min| 21|
| max| 24|
| count| 4|
| mean| 22.0|
+-------+------------------+
df = spark.read.parquet("NameSorted")
df.show()
+---+----+---+
| ID|Name|Age|
+---+----+---+
| 4|Mark| 21|
| 2|Luke| 21|
| 3| Leo| 24|
| 1|Jack| 22|
+---+----+---+
What would be a manner to preserve the order?
Solution 1:[1]
Figured out that order can be saved by getting only one partition with
age = age.coalesce(1)
age.write.parquet("Age")
sort = sort.coalesce(1)
sort.write.parquet("NameSorted")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | thebluephantom |
