'How to compare two dataframes and calculate the differences in PySpark?

I have two dataframes and I am trying to write a function to compare the two dataframes so that it will return me the net changes to columns that are impacted.

DF1:

+---------------+------+------+-------+----------+
| City          | Temp | Zone | Score | Activity |
+---------------+------+------+-------+----------+
| Atlanta       | 10   | 1    | 100   | 400      |
+---------------+------+------+-------+----------+
| Chicago       | 100  | 2    | 200   | 500      |
+---------------+------+------+-------+----------+
| Boston        | 100  | 3    | 300   | 600      |
+---------------+------+------+-------+----------+
| San Francisco | 1000 | 4    | 400   | 700      |
+---------------+------+------+-------+----------+

DF2:

+---------------+------+------+-------+----------+
| City          | Temp | Zone | Score | Activity |
+---------------+------+------+-------+----------+
| Atlanta       | 10   | 1    | 150   | 400      |
+---------------+------+------+-------+----------+
| Chicago       | 100  | 2    | 200   | 450      |
+---------------+------+------+-------+----------+
| Boston        | 100  | 3    | 300   | 650      |
+---------------+------+------+-------+----------+
| San Francisco | 1200 | 4    | 400   | 750      |
+---------------+------+------+-------+----------+

I would like the result to be like:

+---------------+------+------+-------+----------+
| City          | Temp | Zone | Score | Activity |
+---------------+------+------+-------+----------+
| Atlanta       | 0    | 0    | 50    | 0        |
+---------------+------+------+-------+----------+
| Boston        | 0    | 0    | 0     | -50      |
+---------------+------+------+-------+----------+
| San Francisco | 200  | 0    | 0     | 50       |
+---------------+------+------+-------+----------+

I am very new to PySpark, and am wondering how can I achieve this in PySpark?

I tried to do df2.substract(df1) but it just shows me the row in df2 that was not in df1, which is not very straightforward, if I just want to see net changes happened to any columns.

Notes: City name is the unique identifier. Each row is different.

Appreciate your help!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source