'Pyspark Order By 1.5B Records With 20 Distinct Values - Performance Issue
I have the following code:
cache_df = cache_df.orderBy(f.col('last_update').asc()).limit(10000000)
Cache_df contains 350M records and I want to get 10M with the oldest last_update value.
It seems like the reduce operation of the order by order all the data in 1 executor and I am not executing this operation in parallel way
Any idea how to solve it?

Solution 1:[1]
See this article explains the order by: "orderBy() collects all the data into a single executor and then sorts them. This means that the order of the output data is guaranteed but this is probably a very costly operation."
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
