'AWS Glue does not give coherent result for pyspark - orderBy
when running pyspark locally I get correct results with list ordered by BOOK_ID, But when deploying the AWS Glue job, the books seem not to be ordered
root
|-- AUTHORID: integer
|-- NAME: string
|-- BOOK_LIST: array
| |-- BOOK_ID: integer
| |-- BOOK_NAME: string
from pyspark.sql import functions as F
result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
.orderBy(F.col("BOOK_ID").desc())
.groupBy("AUTHOR_ID", "NAME")
.agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
)
Note: I'm using pyspark 3.2.1 and Glue 2.0
Any suggestion please
Solution 1:[1]
Im trying to simplify the issue, work with me:
Lets create a dataframe sample:
>>> df = spark.createDataFrame([
{"book_id": 1, "author_id": 1, "name": "David", "book_name": "Kill Bill"},
{"book_id": 2, "author_id": 2, "name": "Roman", "book_name": "Dying is Hard"},
{"book_id": 3, "author_id": 3, "name": "Moshe", "book_name": "Apache Kafka The Easy Way"},
{"book_id": 4, "author_id": 1, "name": "David", "book_name": "Pyspark Is Awesome"},
{"book_id": 5, "author_id": 2, "name": "Roman", "book_name": "Playing a Piano"},
{"book_id": 6, "author_id": 3, "name": "Moshe", "book_name": "Awesome Scala"}
])
Now, Doing this:
(
df
.groupBy("author_id", "name")
.agg(F.collect_list(F.struct("book_id", "book_name")).alias("data"), F.sum("book_id").alias("sorted_key"))
.orderBy(F.col("sorted_key").desc()).drop("sorted_key")
.show(10, False)
)
Im getting exactly what you are allegedly asking for:
+---------+-----+----------------------------------------------------+
|author_id|name |collect_list(struct(book_id, book_name)) |
+---------+-----+----------------------------------------------------+
|3 |Moshe|[{3, Apache Kafka The Easy Way}, {6, Awesome Scala}]|
|2 |Roman|[{2, Dying is Hard}, {5, Playing a Piano}] |
|1 |David|[{1, Kill Bill}, {4, Pyspark Is Awesome}] |
+---------+-----+----------------------------------------------------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
