'how to pass parameter to dictionary input for agg pyspark function
From the pyspark docs, I Can do:
gdf = df.groupBy(df.name)
sorted(gdf.agg({"*": "first"}).collect())
In my actual use case I have maaaany variables, so I like that I can simply create a dictionary, which is why:
gdf = df.groupBy(df.name)
sorted(gdf.agg(F.first(col, ignorenulls=True)).collect())
@lemon's suggestion won't work for me.
How can I pass a parameter for first (i.e. ignorenulls=True), see here.
Solution 1:[1]
You can use list comprehension.
gdf.agg(*[F.first(x, ignorenulls=True).alias(x) for x in df.columns]).collect()
Solution 2:[2]
Try calling the pyspark function directly:
import pyspark.sql.functions as F
gdf = df.groupBy(df.name)
parameters = {'col': <your_column_name, 'ignorenulls': True}
sorted(gdf.agg(F.first(**parameters)).collect())
Does it work for you?
ps. ignorenulls' is True by default.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Emma |
| Solution 2 |
