'How to Find the consecutive values in PySpark DataFrame column and replace the value
I have a below Dataframe
ID Name Dept
1 John ABC
2 Rio BCD
3 Marry BCD
4 Andy BCD
5 Smith PQR
6 Rich XYZ
7 Lisa LMN
8 Steve LMN
9 Ali STU
We can see that in Dept column BCD is repeated 3 times and LMN is repeated 2 times.
Now I need create the new column Dept_Updated and check for the consecutive values, if there are consecutive values just add underscore at the last and add the number after underscore, if it is not consecutive value leave as it is.
I need the output in the below format.
ID Name Dept Dept_Updated
1 John ABC ABC
2 Rio BCD BCD_1
3 Marry BCD BCD_2
4 Andy BCD BCD_3
5 Smith PQR PQR
6 Rich XYZ XYZ
7 Lisa LMN LMN_1
8 Steve LMN LMN_2
9 Ali STU STU
I am very new to PySpark, is there any way to achieve the above output that it would be really helpful.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
