'Grouping and sum of columns and eliminate duplicates in PySpark

I have a data frame like below in pyspark

df = spark.createDataFrame(
[
('14_100_00','A',25,0),
('14_100_00','A',0,24),
('15_100_00','A',20,1),
('150_100','C',21,0),
('16','A',0,20),
('16','A',20,0)],("rust", "name", "value_1","value_2" ))

df.show()
+---------+----+-------+-------+
|     rust|name|value_1|value_2|
+---------+----+-------+-------+
|14_100_00|   A|     25|      0|
|14_100_00|   A|      0|     24|
|15_100_00|   A|     20|      1|
|  150_100|   C|     21|      0|
|       16|   A|      0|     20|
|       16|   A|     20|      0|
+---------+----+-------+-------+

I am trying to update the value_1 and value_2 columns based on below conditions

when rust and name columns are same then sum of value_1 as value_1 for that group
when rust and name columns are same then sum of value_2 as value_2 for that group

Expected result:

+---------+----+-------+-------+
|     rust|name|value_1|value_2|
+---------+----+-------+-------+
|14_100_00|   A|     25|     24|
|15_100_00|   A|     20|      1|
|  150_100|   C|     21|      0|
|       16|   A|     20|     20|
+---------+----+-------+-------+

I have tried this:

df1 = df.withColumn("VALUE_1", f.sum("VALUE_1").over(Window.partitionBy("rust", "name"))).withColumn("VALUE_2", f.sum("VALUE_2").over(Window.partitionBy("rust", "name")))
df1.show()
+---------+----+-------+-------+
|     rust|name|VALUE_1|VALUE_2|
+---------+----+-------+-------+
|  150_100|   C|     21|      0|
|       16|   A|     20|     20|
|       16|   A|     20|     20|
|14_100_00|   A|     25|     24|
|14_100_00|   A|     25|     24|
|15_100_00|   A|     20|      1|
+---------+----+-------+-------+

Is there a better way to achieve this without having duplicates?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Grouping and sum of columns and eliminate duplicates in PySpark

Sources

Related Questions