'pyspark - aggregation
Say, I have a dataframe as below
mid | bid | m_date1 | m_date2 | m_date3 |
100 | ws | | | 2022-02-01|
200 | gs | 2022-02-01| | |
Now I have an sql aggregation as below
SELECT
mid,
bid,
min(NEXT(m_date1, 'SAT')) as dat1,
min(NEXT(m_date2, 'SAT')) as dat2,
min(NEXT(m_date3, 'SAT')) as dat3
FROM df
GROUPBY 1,2
I am looking to implement above aggregation using Pyspark but wondering if I can use any form of iteration to achieve dat1, dat2 and dat3 as same 'min' function is applied on those columns. I could use below aggregation syntax in PySpark for each column but I am looking to avoid repeating the 'min' function on each aggregated column.
df.groupBy('mid','bid').agg(...)
Thank you
Solution 1:[1]
A sample output would have been better. If I got you right you are after
df.groupby('mid','bid').agg(*[min(i).alias(f"min{i}") for i in df.drop('mid','bid').columns]).show()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | wwnde |
