'Rolling window with conditions using PySpark and how to make it faster on local machine?

I've got csv file of approxiamately 500 Mb and 1,5 million rows. I've made PySpark dataframe. Now I need to use window function and count how many times both condition #1 and condition #2 appeared at once in the window of 3600 sec. In the end I need to find maximum number of appearance of these events in 3600 sec gap. My code is below:

df = df.withColumn('result', F.when((F.col("column_1") == 'condition_1') & (F.col('column_2') == 'condition_2'), 1).otherwise(0))

winSpec = Window.partitionBy('result').orderBy('date').rangeBetween(-3600, 0)

df = df.withColumn('rol_func', F.sum(F.col("result")).over(winSpec))

df.agg(F.max("rol_func")).show()

It works and gives correct result, but it's so slow compared to Pandas code. I work on local machine. So my question is how can I improve the performance of PySpark on it?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source