'Spark: replace all smaller values than X by their sum

I have a dataframe, that has a type and a sub type (broadly speaking). Say something like:

enter image description here

What I'd like to do, is for each type, sum all values that are smaller than X (say 100 here), and replace them with one row where sub-type would be "other" I.e.

enter image description here

Using window over(Type), I guess I could do two dfs (<100, >=100), where the first I'd sum, pick one row and hack it to get the "Other" single row, and union the result with the >= one. But it seems a rather clumsy way to do it?

(apologies, I don't have access to pyspark right now to do some code).



Solution 1:[1]

You could simply replace Sub-Type by other for all rows with Value < 100 and then groupby and sum:

(
    df
    .withColumn('Sub-Type', F.when(F.col('Value') < 100, 'Other').otherwise(F.col('Sub-Type')
    .groupby('Type', 'Sub-Type')
    .agg(
        F.sum('Value').alias('Value')
    )
)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Til Piffl