'Loop through columns in a dataframe and add a new column to the dataframe with the first non null value found. Using PySpark
I'm new to PySpark trying to figure out how to achieve the desired results below.
I have a dataframe which contains several columns. I want to loop through columns id1, id2 and id3 and once the first non null value is found a new column should be added with this value. After finding the value no more loop for that record is needed.
The dataframe:
| name | id1 | hobby | id2 | gender | id3 | language |
|---|---|---|---|---|---|---|
| Mike | AAA-BBB | Fishing | M | AAA-BBB | Eng | |
| Louis | ABC-DDD | M | ||||
| Peter | DSA-SDF | Hunting | DSA-SDF | M | DSA-SDF | Eng |
The desired dataframe:
| name | id1 | hobby | id2 | gender | id3 | language | id |
|---|---|---|---|---|---|---|---|
| Mike | AAA-BBB | Fishing | M | AAA-BBB | Eng | AAA-BBB | |
| Louis | ABC-DDD | M | ABC-DDD | ||||
| Peter | DSA-SDF | Hunting | DSA-SDF | M | DSA-SDF | Eng | DSA-SDF |
Any help would be greatly appreciated.
Solution 1:[1]
You can do that with the following
from pyspark.sql import functions
df = (df.withColumn("id", functions.when(df["id1"].isNotNull(), df["id1"])
.when(df["id2"].isNotNull(), df["id2"])
.when(df["id3"].isNotNull(), df["id3"])))
df.show()
+-----+-------+-------+-------+------+-------+--------+-------+-------+
| name| id1| hobby| id2|gender| id3|language| test| id|
+-----+-------+-------+-------+------+-------+--------+-------+-------+
| Mike|AAA-BBB|Fishing| null| M|AAA-BBB| Eng|AAA-BBB|AAA-BBB|
|Louis| null| null|ABC-DDD| M| null| null|ABC-DDD|ABC-DDD|
|Peter|DSA-SDF|Hunting|DSA-SDF| M|DSA-SDF| Eng|DSA-SDF|DSA-SDF|
+-----+-------+-------+-------+------+-------+--------+-------+-------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | BoomBoxBoy |
