'Pandas better method to compare two dataframes and find entries that only exist in one
I have a DataFrame/spreadsheet that has columns for employee info (name, worksite) and also columns for total hours worked. My main goal is to find employees that exist in one file but not the other.
DataFrame ORIGINAL:
Name Site ....other columns
Anne A
Bob B
Charlie A
Dataframe NEW:
Name Site ....other columns
Anne A
Bob B
Doug B
DataFrame NEW is very similar to ORIGINAL with a few differences and these are the details I am looking to show
- Charlie/A was only in
ORIGINAL - Doug/B was only in
NEW
I found this solution, which works okay but I need to perform it twice to find records in one DataFrame and not the other, and then again but vice-versa.
Here is the code I have:
COLS = ['Name','Site'] # Columns to group by to find a 'unique' record
# Records in New, not in Original
df_right = ORIGINAL.merge(NEW.drop_duplicates(), on=COLS, how='right', indicator=True)
df_right = df_right[df_right._merge != 'both'] # Filter out records that exist in both.
# Records in Original, not in New
df_left = ORIGINAL.merge(NEW.drop_duplicates(), on=COLS, how='left', indicator=True)
df_left = df_left[df_left._merge != 'both']
df = pd.concat([df_left,df_right])
# df now contains Name/Site records that exist in one DataFrame but not the other
Is there a better way to perform this check instead of doing it twice and concatenating?
Solution 1:[1]
Looks like using 'outer' as the how was the solution
z = pd.merge(ORIGINAL, NEW, on=cols, how = 'outer', indicator=True)
z = z[z._merge != 'both'] # Filter out records from both
Output looks like this (after only showing the columns I care about)
Name Site _merge
Charlie A left_only
Doug B right_only
Solution 2:[2]
You can actually convert the dataframes into Indexes, and then simply use isin to check if the whole rows are in the other dataframe:
cols = ['Name', 'Site']
originalI = pd.Index(ORIGINAL[cols])
newI = pd.Index(NEW[cols])
out = pd.concat([
ORIGINAL[~originalI.isin(newI)].assign(from_df='ORIGINAL'),
NEW[~newI.isin(originalI)].assign(from_df='NEW'),
])
Output:
>>> out
Name Site from_df
2 Charlie A ORIGINAL
2 Doug B NEW
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Bijan |
| Solution 2 | richardec |
