'How do you filter a dataframe to make sure values in two columns differ in Pyspark?
I am trying to compare two tables with the same columns and then return columns that conflict. For example: Table A:
| emp_id | emp_name |
|---|---|
| 1 | John |
| 2 | Mary |
Table B:
| emp_id | emp_name |
|---|---|
| 1 | John |
| 2 | Karen |
| 3 | Steve |
In this instance, I want to know that two different names conflict for 2. I do not care that there is an entry in one table that is not in the other, and I don't care if the entry is in both tables if they do not conflict.
So far my approach was to rename the columns as emp_name1 & 2, join the tables and then filter out null values meaning the name only appears in one list this way:
df = df.join(df2, how = 'outer', on = ['emp_id'])
#filter out null vals (meaning no conflict)
df = df.filter((df.emp_name1.isNotNull()) &(df.emp_name2.isNotNull()))
The next step would be to compare the values to see if they are the same, but when I try to do this, it does not work:
df = df.filter((df.emp_name1 = df.emp_name2))
Is there a way to compare columns to each other in this way?
Solution 1:[1]
combined_df = pd.concat([df, df2])
print(combined_df[combined_df.duplicated()])
Output:
emp_id emp_name
0 1 John
Or, since I'm not exactly sure what you're asking;
print(combined_df[~combined_df.duplicated()])
...
emp_id emp_name
0 1 John
1 2 Mary
1 2 Karen
2 3 Steve
print(combined_df[~combined_df.duplicated(keep=False)])
...
emp_id emp_name
1 2 Mary
1 2 Karen
2 3 Steve
Solution 2:[2]
You can join the tables on emp id as you have already done. Then use where otherwise clause to check if there is conflict. Assuming col name after joining as : emp_id, emp_name_1 and emp_name_2
final_df= df.withColumn("conflict_names", when(col("emp_name_1")!=col("emp_name_2"), col("emp_name_2")).otherwise(lit(None).cast(StringType())))
Solution 3:[3]
Your last filter condition is incorrect df = df.filter((df.emp_name1 = df.emp_name2)). It should be
df = df.filter((df.name1 != df.name2))
+---+-----+-----+
| id|name1|name2|
+---+-----+-----+
| 2| Mary|Karen|
+---+-----+-----+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | BeRT2me |
| Solution 2 | preacher |
| Solution 3 | pltc |
