'Spark: replace all smaller values than X by their sum
I have a dataframe, that has a type and a sub type (broadly speaking). Say something like:
What I'd like to do, is for each type, sum all values that are smaller than X (say 100 here), and replace them with one row where sub-type would be "other" I.e.
Using window over(Type), I guess I could do two dfs (<100, >=100), where the first I'd sum, pick one row and hack it to get the "Other" single row, and union the result with the >= one. But it seems a rather clumsy way to do it?
(apologies, I don't have access to pyspark right now to do some code).
Solution 1:[1]
You could simply replace Sub-Type by other for all rows with Value < 100 and then groupby and sum:
(
df
.withColumn('Sub-Type', F.when(F.col('Value') < 100, 'Other').otherwise(F.col('Sub-Type')
.groupby('Type', 'Sub-Type')
.agg(
F.sum('Value').alias('Value')
)
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Til Piffl |


