'How to calculate number of rows obtained after a join, a filter or a write without using count function - Pyspark

I am using PySpark to join, filter and write large dataframe to a csv.

After each filter join or write, I count the number of lines with df.count().

However, counting the number of rows mean reloading the data and re-perform the various operations.

How could I count the number of lines during each different operations without reloading and calculate as with df.count() ?

I am aware that the cache function could be a solution to not reload and recalculate but I am looking for another solution as it's not always the best one.

Thank you in advance!



Solution 1:[1]

Why not look at the spark UI to see get a feel for what's happening instead of using Count? This might help you get a feel without actually doing counts. Jobs/Tasks can help you find the bottlenecks. The SQL tab can help you look at your plan to understand what's actually happening.

If you want something better than count.

countApprox is cheaper.(RDD level tooling) You should be caching if you are going to count it and then use it the dataframe again after. Actually count is sometimes use to force caching.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Matt Andruff