'PySpark add multiple columns based on categories from the other column

I have a dataset that looks like this:

id | category | value
---+----------+------
1  | a        | 3
2  | a        | 3
3  | a        | 3
3  | b        | 1
4  | a        | 1
4  | b        | abc

The output I want is:

id | category_a | category_b
---+------------+--------
1  | 3          | null
2  | 3          | null
3  | 3          | 1
4  | 1          | abc

It means that it will groupBy id, category and creates dummy columns.

How can I transform the input to the expected output?

My approach is:

pivoted_df = df.groupBy("id") \
        .pivot("category") \
        .agg(F.lit(F.col("value")))

But I got this error:

pyspark.sql.utils.AnalysisException: Aggregate expression required for pivot, but '`value`' did not appear in any aggregate function.;

Update: The value column contains non-numeric value also.

For the category column, each id will have 2 rows only with respect to 2 categories a, b.

python pyspark

Solution 1:^[1]

df = df.groupBy('id').pivot('category').agg(F.first('value'))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'PySpark add multiple columns based on categories from the other column

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]