'pyspark dataframe grouping

I'm using pyspark and i have a df like this below:

id	time	group
1	4	A
1	14	A
1	22	B
2	16	B
2	23	B
2	100	C
3	13	C
3	10	C

i want to build a new column "result" like this:

| id| result|
|:---- |:------|
| 1| [A -> 18 , B ->22]|
| 2| [B -> 39 , C -> 100] |
| 3| [C -> 23]

dataframe pyspark

Solution 1:^[1]

You might need 2 levels of aggregation:

from pyspark.sql import functions as F

out = df.groupBy("id","group").agg(F.sum("time").alias("time"))\
.groupBy("id").agg(F.map_from_arrays(*[F.collect_list("group"),
                                       F.collect_list("time")])
                   .alias("result"))

out.show()

+---+-------------------+
| id|             result|
+---+-------------------+
|  1| {B -> 22, A -> 18}|
|  3|          {C -> 23}|
|  2|{B -> 39, C -> 100}|
+---+-------------------+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	anky

'pyspark dataframe grouping

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]