'How to make a new pyspark column using the previous rows value of this new column
So I am using a stock dataset and trying to calculate the 20-day EMA, so I need the previous day's EMA. However, I am not sure how to do this.
I added a mid column to keep track of what day it was. The first 20 days are null as you need those for the 20-day SMA, which is used in the calculation of the first EMA. So, the first EMA I can calculate is on the 21st day. Then any day after that I need to use the previous days EMA in the calculation. For simplicity, I have replaced the formula for the EMA with just the value that I need from the previous row.
window = Window().partitionBy("Name").orderBy("date")
EMA = SMA.withColumn("mid", monotonically_increasing_id()) \
.withColumn("20_EMA",
when(col("mid") < 20, None)
.otherwise(when(col("mid") == 20, lag(col("20_SMA"),1).over(window) )
.otherwise( lag(col("20_EMA"),1).over(window) ) )
).drop("mid")
I feel like I am just going about it wrong, as I am new to pyspark.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
