'Generate a report of mismatch Columns between 2 Pyspark dataframes
Team, we have a requirement to generate a report of mismatched columns based on key field between 2 Pyspark dataframes of exactly same structure.
Here is first dataframe-
>>> df.show()
+--------+----+----+----+----+----+----+----+----+
| key|col1|col2|col3|col4|col5|col6|col7|col8|
+--------+----+----+----+----+----+----+----+----+
| abcd| 123| xyz| a| ab| abc| def| qew| uvw|
| abcd1| 123| xyz| a| ab| abc| def| qew| uvw|
| abcd12| 123| xyz| a| ab| abc| def| qew| uvw|
| abcd123| 123| xyz| a| ab| abc| def| qew| uvw|
|abcd1234| 123| xyz| a| ab| abc| def| qew| uvw|
+--------+----+----+----+----+----+----+----+----+
And here is 2nd dataframe-
>>> df1.show()
+--------+----+----+----+----+----+----+----+----+
| key|col1|col2|col3|col4|col5|col6|col7|col8|
+--------+----+----+----+----+----+----+----+----+
| abcd| 123| xyz| a| ab| abc| def| qew| uvw|
| abcdx| 123| xyz| a| ab| abc| def| qew| uvw|
| abcd12| 123| xyz| a| abx| abc|defg| qew| uvw|
| abcd123| 123| xyz| a| ab| abc|defg| qew| uvw|
|abcd1234| 123| xyz| a| ab|abcd|defg| qew| uvw|
+--------+----+----+----+----+----+----+----+----+
Full Outer Join gives me this-
>>> dfFull=df.join(df1,'key','outer')
>>> dfFull.show()
+--------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| key|col1|col2|col3|col4|col5|col6|col7|col8|col1|col2|col3|col4|col5|col6|col7|col8|
+--------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| abcd12| 123| xyz| a| ab| abc| def| qew| uvw| 123| xyz| a| abx| abc|defg| qew| uvw|
| abcd1| 123| xyz| a| ab| abc| def| qew| uvw|null|null|null|null|null|null|null|null|
|abcd1234| 123| xyz| a| ab| abc| def| qew| uvw| 123| xyz| a| ab|abcd|defg| qew| uvw|
| abcd123| 123| xyz| a| ab| abc| def| qew| uvw| 123| xyz| a| ab| abc|defg| qew| uvw|
| abcdx|null|null|null|null|null|null|null|null| 123| xyz| a| ab| abc| def| qew| uvw|
| abcd| 123| xyz| a| ab| abc| def| qew| uvw| 123| xyz| a| ab| abc| def| qew| uvw|
+--------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
if i just look at col6, there are 5 values which mismatched for the "key" field (only value match is for last record).
>>> dfFull.select('key',df['col6'],df1['col6']).show()
+--------+----+----+
| key|col6|col6|
+--------+----+----+
| abcd12| def|defg|
| abcd1| def|null|
|abcd1234| def|defg|
| abcd123| def|defg|
| abcdx|null| def|
| abcd| def| def|
+--------+----+----+
I need to generate a report of something like this for all the columns. The mismatch sample can be any record's value from dataframes.
colName,NumofMismatch,mismatchSampleFromDf,misMatchSamplefromDf1
col6,5,def,defg
col7,2,null,qew
col8,2,null,uvw
col5,3,null,abc
It is a column wise summary based on key, saying how many values are mismatch between 2 dataframes.
Sid
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
