'Spark dataframe - transform rows with same ID to columns
I want to transform below source dataframe (using pyspark):
| Key | ID | segment |
|---|---|---|
| 1 | A | m1 |
| 2 | A | m1 |
| 3 | B | m1 |
| 4 | C | m2 |
| 1 | D | m1 |
| 2 | E | m1 |
| 3 | F | m1 |
| 4 | G | m2 |
| 1 | J | m1 |
| 2 | J | m1 |
| 3 | J | m1 |
| 4 | K | m2 |
Into below result dataframe:
| ID | key1 | key2 |
|---|---|---|
| A | 1 | 2 |
| B | 3 | - |
| C | 4 | - |
| D | 1 | - |
| F | 3 | - |
| G | 4 | - |
| J | 1 | 2 |
| J | 1 | 3 |
| J | 2 | 3 |
| K | 4 | - |
In other words: I want to highlight the "pairs" in the dataframe - If I have more than one key for the same ID, I would like to point each relation in diferents lines.
Thank you for your help
Solution 1:[1]
Use window functions. I assume - means a one man group. If not you can use when/otherwise contion to blank the 1s out.
w =Window.partitionBy('ID').orderBy(desc('Key'))
df= (df.withColumn('key2', lag('segment').over(w))# create new column with value of preceding segment for each row
.withColumn('key2', col('key2').isNotNull())# query to create boolean selection
.withColumn('key2',F.sum(F.col('key2').cast('integer')).over(w.rowsBetween(Window.currentRow, sys.maxsize))+1)#Create cumulative groups
.orderBy('ID', 'key')#Reorder frame
)
df.show()
+---+---+-------+----+
|Key| ID|segment|key2|
+---+---+-------+----+
| 1| A| m1| 2|
| 2| A| m1| 2|
| 3| B| m1| 1|
| 4| C| m2| 1|
| 1| D| m1| 1|
| 2| E| m1| 1|
| 3| F| m1| 1|
| 4| G| m2| 1|
| 1| J| m1| 2|
| 2| J| m1| 3|
| 3| J| m1| 3|
| 4| K| m2| 1|
+---+---+-------+----+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | wwnde |
