'PySpark - how to select all columns to be used in groupby

I'm trying to chain a join and groupby operation together. The inputs and operations I want to do look like below. I want to groupby all the columns except the one used in agg. Is there a way of doing this without listing out all the column names like groupby("colA","colB")? I tried groupby(df1.*)but that didn't work. In this case I know that I'd like to group by all the columns in df1. Many thanks.

Input1:
colA |  ColB
--------------
 A   | 100
 B   | 200

Input2:
colAA |  ColBB
--------------
 A   | Group1
 B   | Group2
 A   | Group2

df1.join(df2, df1colA==df2.colAA,"left").drop("colAA").groupby("colA","colB"),agg(collect_set("colBB"))
 #Is there a way that I do not need to list ("colA","colB") in groupby? there will be many cloumns. 

Output:
 colA |  ColB | collect_set
--------------
 A   | 100    | (Group1,Group2)
 B   | 200    | (Group2)

Solution 1:^[1]

Based on your clarifying comments, use df1.columns

 df1.join(df2, df1.colA==df2.colAA,"left").drop("colAA").groupby(df1.columns).agg(collect_set("colBB").alias('new')).show()
+----+----+----------------+
|colA|ColB|             new|
+----+----+----------------+
|   A| 100|[Group2, Group1]|
|   B| 200|        [Group2]|
+----+----+----------------+

Solution 2:^[2]

Just simple:

.groupby(df1.columns)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	wwnde
Solution 2	Wojciech Pawlik

'PySpark - how to select all columns to be used in groupby

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]