'Is there a way to optimize pyspark operations after unioning multiple Data frames?

I will try to explain with an example. Let's say I have a pyspark dataframe- df_input, a pyspark function some_func that works on df_input based on a list of dates:

import reduce
from pyspark.sql import DataFrame
dates = ['2006-01'... '2006-08'] # This is a list of dates from 2006-01 to 2008-01 so 24 dates. 
# I didn't want to write out the full list here
some_func(df_input, date):
    df_output = do_something_based_on_date
    return df_output
list_dfs = []
for date in dates:
    df_output = some_func(df_input, date)
    list_dfs.append(df_output)
final_df = reduce(DatFrame.union(), list_dfs)

Now the problem is that any operation on the final_df seems to take an exponentially long amount of time compared to individual data frames e.g. taking just a simple count:

df_output.count() # Takes 30-35 sec
final_df .count() # Took 3 min 53 sec

I would like to say here that the data sizes aren't huge. Each individual data frame is around 15-30k rows and the final one is around 300k rows. Why do operations on unioned datasets take so much longer? Is there any way to optimize this? Please let me know if further information is required.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source