'How to Find the consecutive values in PySpark DataFrame column and replace the value

I have a below Dataframe

ID    Name     Dept
1     John     ABC
2     Rio      BCD
3     Marry    BCD
4     Andy     BCD
5     Smith    PQR
6     Rich     XYZ
7     Lisa     LMN
8     Steve    LMN
9     Ali      STU

We can see that in Dept column BCD is repeated 3 times and LMN is repeated 2 times.

Now I need create the new column Dept_Updated and check for the consecutive values, if there are consecutive values just add underscore at the last and add the number after underscore, if it is not consecutive value leave as it is.

I need the output in the below format.

ID    Name     Dept   Dept_Updated
1     John     ABC        ABC
2     Rio      BCD        BCD_1
3     Marry    BCD        BCD_2
4     Andy     BCD        BCD_3
5     Smith    PQR        PQR
6     Rich     XYZ        XYZ
7     Lisa     LMN        LMN_1
8     Steve    LMN        LMN_2
9     Ali      STU        STU

I am very new to PySpark, is there any way to achieve the above output that it would be really helpful.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to Find the consecutive values in PySpark DataFrame column and replace the value

Sources

Related Questions