'How to make a new pyspark column using the previous rows value of this new column

So I am using a stock dataset and trying to calculate the 20-day EMA, so I need the previous day's EMA. However, I am not sure how to do this.

I added a mid column to keep track of what day it was. The first 20 days are null as you need those for the 20-day SMA, which is used in the calculation of the first EMA. So, the first EMA I can calculate is on the 21st day. Then any day after that I need to use the previous days EMA in the calculation. For simplicity, I have replaced the formula for the EMA with just the value that I need from the previous row.

window = Window().partitionBy("Name").orderBy("date")

EMA = SMA.withColumn("mid", monotonically_increasing_id()) \
               .withColumn("20_EMA",
                when(col("mid") < 20, None) 
                .otherwise(when(col("mid") == 20, lag(col("20_SMA"),1).over(window) )
                  .otherwise( lag(col("20_EMA"),1).over(window) )  )
            ).drop("mid")

I feel like I am just going about it wrong, as I am new to pyspark.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source