'Get single column from dummy columns in PySpark
I have a PySpark df like the following, where event, 2020-01 and 2020-02 are dummies (in total I have 18 month).
df = (
sc.parallelize([
("A", 1, 0, 1), ("B", 0, 1, 0), ("C", 0, 1, 0),
("D", 1, 1, 1),
]).toDF(["id", "event", "2020-01", "2020-02"])
)
id event 2020-01 2020-02
A 1 0 1
B 0 1 0
C 0 1 0
D 1 1 1
and I want to create a new df with id, event and month, where month is a column created from "2020-01" and "2020-02" where those columns == 1. So the desired df would look like the following:
id event month
A 1 2020-02
B 0 2020-01
C 0 2020-01
D 1 2020-01
D 1 2020-02
I found a pandas solution here, which does what I am looking for but my df is to big for pandas. I did not manage to get this solution to work. It just created each month for each ID.
Solution 1:[1]
You can do that with the following
from spark.sql import functions
df1 = df.select("id", "event").where(df["2020-01"] == 1).withColumn("month", functions.lit("2020-01"))
df2 = df.select("id", "event").where(df["2020-02"] == 1).withColumn("month", functions.lit("2020-02"))
df1 = df1.unionAll(df2).orderBy("id")
df1.show()
+---+-----+-------+
| id|event| month|
+---+-----+-------+
| A| 1|2020-02|
| B| 0|2020-01|
| C| 0|2020-01|
| D| 1|2020-01|
| D| 1|2020-02|
+---+-----+-------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | BoomBoxBoy |
