'Which is more appropriate to use .count() or .agg(F.sum('count')).collect()[0][0]?

I am working in MS Azure with Databricks, writing in Python while I query my data and wondering if it is more efficient or appropriate to use one of the following for obtaining total counts of events in a spark df.

Either like this:

df_count = (
  df_main
  .select('name', 'id', 'point_of_origin')
  .where(col('point_of_origin') == 'option_1')
  .distinct()
  .groupby(
  .agg(F.sum('count')).collect()[0][0] 
)

OR, like this:

df_count = (
      df_main
      .select('name', 'id', 'point_of_origin')
      .where(col('point_of_origin') == 'option_1')
      .distinct()
      .count()
    )

NOTE: I get the same values for each option, but the first option is slightly faster, although the speed varies based on the load on the cluster in any given run.



Solution 1:[1]

.collect() will retrieve all chunks of df_main from all nodes to the driver node. It will be faster in small dataframes because the data will not be spread around but if the dataframe is higher than the driver memory, you will propably experiment an OutOfMemory error.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Luiz Viola