'How to generate the columns based on the unique values of that particular column in pyspark?

I have a dataframe as below

+----------+------------+---------------------+
|CustomerNo|size        |total_items_purchased|
+----------+------------+---------------------+
|  208261.0|          A |                    2|
|  208263.0|          C |                    1|
|  208261.0|          E |                    1|
|  208262.0|          B |                    2|
|  208264.0|          D |                    3|
+----------+------------+---------------------+

I have another table df that consists of customerNo's only. I have to create columns of unique comfortStyles and have to update the total_items_purchased in df

My df table should look like

CustomerNo,size_A,size_B,size_C,size_D,size_E
208261.0     1      0      0      0    1
208262.0     0      2      0      0    0
208263.0     0      0      1      0    0
208264.0     0      0      0      3    0

Can anyone tell me how to do this?



Solution 1:[1]

You can use pivot function to rearrange the table.

df = (df.groupBy('CustomerNo')
      .pivot('size')
      .agg(F.first('total_items_purchased'))
      .na.fill(0))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Emma