'Convert Multiple columns into a single row with a variable amount of columns

I have a spark dataframe containing businesses with their contact numbers in 2 columns, however some of my businesses are repeated with different contact info, for example:

Name:	Phone:
bus1	082...
bus1	087...
bus2	076...
bus3	081...
bus3	084...
bus3	086...

I want to have 3 lines, 1 for each business with varying phone numbers in each, for example:

Name:	Phone1:	Phone2:	Phone3:
bus1	082...	087...
bus2	076...
bus3	081...	084...	086...

I have tried using select('Name','Phone').distinct(), but I don't know how to pivot it to a single row matching on the 'Name' column... please help

Solution 1:^[1]

Try something as below -

Input DataFrame

df = spark.createDataFrame([('bus1', '082...'), ('bus1', '087...'), ('bus2', '076...'), ('bus3', '081...'),('bus3', '084...'),('bus3', '086...')], schema=["Name", "Phone"])
df.show()

+----+------+
|Name| Phone|
+----+------+
|bus1|082...|
|bus1|087...|
|bus2|076...|
|bus3|081...|
|bus3|084...|
|bus3|086...|
+----+------+

Collecting all the Phone values into an array using collect_list

from pyspark.sql.functions import *
from pyspark.sql.types import *

df1 = df.groupBy("Name").agg(collect_list(col("Phone")).alias("Phone")).select( "Name", "Phone")
df1.show(truncate=False)

+----+------------------------+
|Name|Phone                   |
+----+------------------------+
|bus1|[082..., 087...]        |
|bus2|[076...]                |
|bus3|[081..., 084..., 086...]|
+----+------------------------+

Splitting Phone into multiple columns

df1.select(['Name'] + [df1.Phone[x].alias(f"Phone{x+1}") for x in range(0,3)]).show(truncate=False)

+----+------+------+------+
|Name|Phone1|Phone2|Phone3|
+----+------+------+------+
|bus1|082...|087...|null  |
|bus2|076...|null  |null  |
|bus3|081...|084...|086...|
+----+------+------+------+

Solution 2:^[2]

First construct the phone array based on name, and then split the array into multiple columns.

df = df.groupBy('Name').agg(F.collect_list('Phone').alias('Phone'))
df = df.select('Name', *[F.col('Phone')[i].alias(f'Phone{str(i+1)}') for i in range(3)])
df.show(truncate=False)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	DKNY
Solution 2	è¿‡è¿‡æ‹›

'Convert Multiple columns into a single row with a variable amount of columns

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]