'How do you filter a dataframe to make sure values in two columns differ in Pyspark?

I am trying to compare two tables with the same columns and then return columns that conflict. For example: Table A:

emp_id emp_name
1 John
2 Mary

Table B:

emp_id emp_name
1 John
2 Karen
3 Steve

In this instance, I want to know that two different names conflict for 2. I do not care that there is an entry in one table that is not in the other, and I don't care if the entry is in both tables if they do not conflict.

So far my approach was to rename the columns as emp_name1 & 2, join the tables and then filter out null values meaning the name only appears in one list this way:

df = df.join(df2, how = 'outer', on = ['emp_id'])
#filter out null vals (meaning no conflict)
df = df.filter((df.emp_name1.isNotNull()) &(df.emp_name2.isNotNull()))

The next step would be to compare the values to see if they are the same, but when I try to do this, it does not work: df = df.filter((df.emp_name1 = df.emp_name2))

Is there a way to compare columns to each other in this way?



Solution 1:[1]

combined_df = pd.concat([df, df2])
print(combined_df[combined_df.duplicated()])

Output:

   emp_id emp_name
0       1     John

Or, since I'm not exactly sure what you're asking;

print(combined_df[~combined_df.duplicated()])
...
   emp_id emp_name
0       1     John
1       2     Mary
1       2    Karen
2       3    Steve
print(combined_df[~combined_df.duplicated(keep=False)])
...
   emp_id emp_name
1       2     Mary
1       2    Karen
2       3    Steve

Solution 2:[2]

You can join the tables on emp id as you have already done. Then use where otherwise clause to check if there is conflict. Assuming col name after joining as : emp_id, emp_name_1 and emp_name_2

final_df= df.withColumn("conflict_names", when(col("emp_name_1")!=col("emp_name_2"), col("emp_name_2")).otherwise(lit(None).cast(StringType())))

Solution 3:[3]

Your last filter condition is incorrect df = df.filter((df.emp_name1 = df.emp_name2)). It should be

df = df.filter((df.name1 != df.name2))

+---+-----+-----+
| id|name1|name2|
+---+-----+-----+
|  2| Mary|Karen|
+---+-----+-----+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 BeRT2me
Solution 2 preacher
Solution 3 pltc