'Mark only first change in column values for a group PySpark

I have a dataframe as follows:

+---------+-----+----------+
|  tx_grp |offer|old_offer |
+---------+-----+----------+
|Company_B|   10|      null| 
|Company_B|   10|      null| 
|Company_B|   12|        10|  
|Company_B|   12|        10|  
|Company_A|  101|      null| 
|Company_A|  101|      null|
|Company_A|  109|       101| 
|Company_A|  109|       101| 
+---------+-----+----------+

I tried: df = df.withColumn('isChanged',when(F.col('offer')!=F.col('old_offer'),'yes').otherwise(0))

but I get:

+---------+-----+----------+---------+
|  tx_grp |offer|old_offer |isChanged|
+---------+-----+----------+---------+
|Company_B|   10|      null|   0     |
|Company_B|   10|      null|   0     |
|Company_B|   12|        10| yes     |
|Company_B|   12|        10| yes     |
|Company_A|  101|      null|   0     |
|Company_A|  109|       101|  yes    |
|Company_A|  109|       101|  yes    |
|Company_A|  109|       101|  yes    |
+---------+-----+----------+---------+

I want to mark only first event of the change, how can I achieve that. what I want to have is:

+---------+-----+----------+---------+
|  tx_grp |offer|old_offer |isChanged|
+---------+-----+----------+---------+
|Company_B|   10|      null|   0     |
|Company_B|   10|      null|   0     |
|Company_B|   12|        10|  yes    |
|Company_B|   12|        10|   0     |
|Company_A|  101|      null|   0     |
|Company_A|  109|       101|  yes    |
|Company_A|  109|       101|   0     |
|Company_A|  109|       101|   0     |
+---------+-----+----------+---------+

pyspark group-by

Solution 1:^[1]

Use window functions

 w=Window.partitionBy('tx_grp').orderBy(desc('tx_grp'))
    
(df.withColumn('ischanged', lag('old_offer').over(w)).na.fill(0).withColumn('ischanged', when(col('ischanged')==col('old_offer'),'0').otherwise('yes'))
).show()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	wwnde

'Mark only first change in column values for a group PySpark

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]