'Spark dataframe - transform rows with same ID to columns

I want to transform below source dataframe (using pyspark):

Key	ID	segment
1	A	m1
2	A	m1
3	B	m1
4	C	m2
1	D	m1
2	E	m1
3	F	m1
4	G	m2
1	J	m1
2	J	m1
3	J	m1
4	K	m2

Into below result dataframe:

ID	key1	key2
A	1	2
B	3	-
C	4	-
D	1	-
F	3	-
G	4	-
J	1	2
J	1	3
J	2	3
K	4	-

In other words: I want to highlight the "pairs" in the dataframe - If I have more than one key for the same ID, I would like to point each relation in diferents lines.

Thank you for your help

Solution 1:^[1]

Use window functions. I assume - means a one man group. If not you can use when/otherwise contion to blank the 1s out.

w =Window.partitionBy('ID').orderBy(desc('Key'))

df= (df.withColumn('key2', lag('segment').over(w))# create new column with value of preceding segment for each row
     .withColumn('key2', col('key2').isNotNull())# query to create boolean selection
     .withColumn('key2',F.sum(F.col('key2').cast('integer')).over(w.rowsBetween(Window.currentRow, sys.maxsize))+1)#Create cumulative groups
     .orderBy('ID', 'key')#Reorder frame
    )

df.show()

+---+---+-------+----+
|Key| ID|segment|key2|
+---+---+-------+----+
|  1|  A|     m1|   2|
|  2|  A|     m1|   2|
|  3|  B|     m1|   1|
|  4|  C|     m2|   1|
|  1|  D|     m1|   1|
|  2|  E|     m1|   1|
|  3|  F|     m1|   1|
|  4|  G|     m2|   1|
|  1|  J|     m1|   2|
|  2|  J|     m1|   3|
|  3|  J|     m1|   3|
|  4|  K|     m2|   1|
+---+---+-------+----+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	wwnde

'Spark dataframe - transform rows with same ID to columns

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]