'Preserve row order on parquet file

It appears that parquet files do not preserve the order of rows. Per instance, I am trying to pass hands-on with code

from pyspark.sql import SparkSession, Row

spark = SparkSession.builder.getOrCreate()

Header = Row("ID","Name","Age","Area of Interest")
u1 = Header("1","Jack",22,"Data Science")
u2 = Header("2","Luke",21,"Data Analytics")
u3 = Header("3","Leo",24,"Micro Services")
u4 = Header("4","Mark",21,"Data Analytics")
data = [u1,u2,u3,u4]
df = spark.createDataFrame(data)

age = df.describe("Age")
age.write.parquet("Age")
age.show()

sort = df.select("ID","Name","Age").orderBy("Name",ascending=False)
sort.write.parquet("NameSorted")
sort.show()

Showing the resutl of

+-------+------------------+
|summary|               Age|
+-------+------------------+
|  count|                 4|
|   mean|              22.0|
| stddev|1.4142135623730951|
|    min|                21|
|    max|                24|
+-------+------------------+

+---+----+---+
| ID|Name|Age|
+---+----+---+
|  4|Mark| 21|
|  2|Luke| 21|
|  3| Leo| 24|
|  1|Jack| 22|
+---+----+---+

However, when saved parquet file is read, the order is no more

df = spark.read.parquet("Age")
df.show()

+-------+------------------+
|summary|               Age|
+-------+------------------+
| stddev|1.4142135623730951|
|    min|                21|
|    max|                24|
|  count|                 4|
|   mean|              22.0|
+-------+------------------+

df = spark.read.parquet("NameSorted")
df.show()

+---+----+---+
| ID|Name|Age|
+---+----+---+
|  4|Mark| 21|
|  2|Luke| 21|
|  3| Leo| 24|
|  1|Jack| 22|
+---+----+---+

What would be a manner to preserve the order?

Solution 1:^[1]

Figured out that order can be saved by getting only one partition with

age = age.coalesce(1)
age.write.parquet("Age")

sort = sort.coalesce(1)
sort.write.parquet("NameSorted")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	thebluephantom

'Preserve row order on parquet file

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]