'Pyspark Order By 1.5B Records With 20 Distinct Values - Performance Issue

I have the following code:

    cache_df = cache_df.orderBy(f.col('last_update').asc()).limit(10000000)

Cache_df contains 350M records and I want to get 10M with the oldest last_update value.
It seems like the reduce operation of the order by order all the data in 1 executor and I am not executing this operation in parallel way
Any idea how to solve it? enter image description here



Solution 1:[1]

See this article explains the order by: "orderBy() collects all the data into a single executor and then sorts them. This means that the order of the output data is guaranteed but this is probably a very costly operation."

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1