'How to compare two dataframes and calculate the differences in PySpark?
I have two dataframes and I am trying to write a function to compare the two dataframes so that it will return me the net changes to columns that are impacted.
DF1:
+---------------+------+------+-------+----------+
| City | Temp | Zone | Score | Activity |
+---------------+------+------+-------+----------+
| Atlanta | 10 | 1 | 100 | 400 |
+---------------+------+------+-------+----------+
| Chicago | 100 | 2 | 200 | 500 |
+---------------+------+------+-------+----------+
| Boston | 100 | 3 | 300 | 600 |
+---------------+------+------+-------+----------+
| San Francisco | 1000 | 4 | 400 | 700 |
+---------------+------+------+-------+----------+
DF2:
+---------------+------+------+-------+----------+
| City | Temp | Zone | Score | Activity |
+---------------+------+------+-------+----------+
| Atlanta | 10 | 1 | 150 | 400 |
+---------------+------+------+-------+----------+
| Chicago | 100 | 2 | 200 | 450 |
+---------------+------+------+-------+----------+
| Boston | 100 | 3 | 300 | 650 |
+---------------+------+------+-------+----------+
| San Francisco | 1200 | 4 | 400 | 750 |
+---------------+------+------+-------+----------+
I would like the result to be like:
+---------------+------+------+-------+----------+
| City | Temp | Zone | Score | Activity |
+---------------+------+------+-------+----------+
| Atlanta | 0 | 0 | 50 | 0 |
+---------------+------+------+-------+----------+
| Boston | 0 | 0 | 0 | -50 |
+---------------+------+------+-------+----------+
| San Francisco | 200 | 0 | 0 | 50 |
+---------------+------+------+-------+----------+
I am very new to PySpark, and am wondering how can I achieve this in PySpark?
I tried to do df2.substract(df1) but it just shows me the row in df2 that was not in df1, which is not very straightforward, if I just want to see net changes happened to any columns.
Notes: City name is the unique identifier. Each row is different.
Appreciate your help!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
