'PySpark - how to select all columns to be used in groupby
I'm trying to chain a join and groupby operation together. The inputs and operations I want to do look like below. I want to groupby all the columns except the one used in agg. Is there a way of doing this without listing out all the column names like groupby("colA","colB")? I tried groupby(df1.*)but that didn't work. In this case I know that I'd like to group by all the columns in df1. Many thanks.
Input1:
colA | ColB
--------------
A | 100
B | 200
Input2:
colAA | ColBB
--------------
A | Group1
B | Group2
A | Group2
df1.join(df2, df1colA==df2.colAA,"left").drop("colAA").groupby("colA","colB"),agg(collect_set("colBB"))
#Is there a way that I do not need to list ("colA","colB") in groupby? there will be many cloumns.
Output:
colA | ColB | collect_set
--------------
A | 100 | (Group1,Group2)
B | 200 | (Group2)
Solution 1:[1]
Based on your clarifying comments, use df1.columns
df1.join(df2, df1.colA==df2.colAA,"left").drop("colAA").groupby(df1.columns).agg(collect_set("colBB").alias('new')).show()
+----+----+----------------+
|colA|ColB| new|
+----+----+----------------+
| A| 100|[Group2, Group1]|
| B| 200| [Group2]|
+----+----+----------------+
Solution 2:[2]
Just simple:
.groupby(df1.columns)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | wwnde |
| Solution 2 | Wojciech Pawlik |
