'Grouping and sum of columns and eliminate duplicates in PySpark
I have a data frame like below in pyspark
df = spark.createDataFrame(
[
('14_100_00','A',25,0),
('14_100_00','A',0,24),
('15_100_00','A',20,1),
('150_100','C',21,0),
('16','A',0,20),
('16','A',20,0)],("rust", "name", "value_1","value_2" ))
df.show()
+---------+----+-------+-------+
| rust|name|value_1|value_2|
+---------+----+-------+-------+
|14_100_00| A| 25| 0|
|14_100_00| A| 0| 24|
|15_100_00| A| 20| 1|
| 150_100| C| 21| 0|
| 16| A| 0| 20|
| 16| A| 20| 0|
+---------+----+-------+-------+
I am trying to update the value_1 and value_2 columns based on below conditions
- when
rustandnamecolumns are same then sum ofvalue_1asvalue_1for that group - when
rustandnamecolumns are same then sum ofvalue_2asvalue_2for that group
Expected result:
+---------+----+-------+-------+
| rust|name|value_1|value_2|
+---------+----+-------+-------+
|14_100_00| A| 25| 24|
|15_100_00| A| 20| 1|
| 150_100| C| 21| 0|
| 16| A| 20| 20|
+---------+----+-------+-------+
I have tried this:
df1 = df.withColumn("VALUE_1", f.sum("VALUE_1").over(Window.partitionBy("rust", "name"))).withColumn("VALUE_2", f.sum("VALUE_2").over(Window.partitionBy("rust", "name")))
df1.show()
+---------+----+-------+-------+
| rust|name|VALUE_1|VALUE_2|
+---------+----+-------+-------+
| 150_100| C| 21| 0|
| 16| A| 20| 20|
| 16| A| 20| 20|
|14_100_00| A| 25| 24|
|14_100_00| A| 25| 24|
|15_100_00| A| 20| 1|
+---------+----+-------+-------+
Is there a better way to achieve this without having duplicates?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
