'Mark only first change in column values for a group PySpark
I have a dataframe as follows:
+---------+-----+----------+
| tx_grp |offer|old_offer |
+---------+-----+----------+
|Company_B| 10| null|
|Company_B| 10| null|
|Company_B| 12| 10|
|Company_B| 12| 10|
|Company_A| 101| null|
|Company_A| 101| null|
|Company_A| 109| 101|
|Company_A| 109| 101|
+---------+-----+----------+
I tried: df = df.withColumn('isChanged',when(F.col('offer')!=F.col('old_offer'),'yes').otherwise(0))
but I get:
+---------+-----+----------+---------+
| tx_grp |offer|old_offer |isChanged|
+---------+-----+----------+---------+
|Company_B| 10| null| 0 |
|Company_B| 10| null| 0 |
|Company_B| 12| 10| yes |
|Company_B| 12| 10| yes |
|Company_A| 101| null| 0 |
|Company_A| 109| 101| yes |
|Company_A| 109| 101| yes |
|Company_A| 109| 101| yes |
+---------+-----+----------+---------+
I want to mark only first event of the change, how can I achieve that. what I want to have is:
+---------+-----+----------+---------+
| tx_grp |offer|old_offer |isChanged|
+---------+-----+----------+---------+
|Company_B| 10| null| 0 |
|Company_B| 10| null| 0 |
|Company_B| 12| 10| yes |
|Company_B| 12| 10| 0 |
|Company_A| 101| null| 0 |
|Company_A| 109| 101| yes |
|Company_A| 109| 101| 0 |
|Company_A| 109| 101| 0 |
+---------+-----+----------+---------+
Solution 1:[1]
Use window functions
w=Window.partitionBy('tx_grp').orderBy(desc('tx_grp'))
(df.withColumn('ischanged', lag('old_offer').over(w)).na.fill(0).withColumn('ischanged', when(col('ischanged')==col('old_offer'),'0').otherwise('yes'))
).show()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | wwnde |
