'PySpark add multiple columns based on categories from the other column
I have a dataset that looks like this:
id | category | value
---+----------+------
1 | a | 3
2 | a | 3
3 | a | 3
3 | b | 1
4 | a | 1
4 | b | abc
The output I want is:
id | category_a | category_b
---+------------+--------
1 | 3 | null
2 | 3 | null
3 | 3 | 1
4 | 1 | abc
It means that it will groupBy id, category and creates dummy columns.
How can I transform the input to the expected output?
My approach is:
pivoted_df = df.groupBy("id") \
.pivot("category") \
.agg(F.lit(F.col("value")))
But I got this error:
pyspark.sql.utils.AnalysisException: Aggregate expression required for pivot, but '`value`' did not appear in any aggregate function.;
Update: The value column contains non-numeric value also.
For the category column, each id will have 2 rows only with respect to 2 categories a, b.
Solution 1:[1]
df = df.groupBy('id').pivot('category').agg(F.first('value'))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
