'Non-matching rows between two DataFrames
I am new to Python and would like to seek you help on this please. I would like to find out the non-matching rows between 2 dataframes i.e. df1 and df2 with thousands of rows. They both contain the same number of columns with the same name.
df2 has 10 entries lesser than df1 which I am trying to find out what they are.
I have tried pd.concat([df1,df2]).drop_duplicates(keep=False) but it returns zero result.
What could be the reason? Any help/advice would be much appreciated. Thanks a lot.
Solution 1:[1]
The following code will remove rows in df1 that are present in df2
df1[~df1.isin(df2)]
Solution 2:[2]
concat combines two frames. You're trying to find the difference between two frames. This can be done with compare. As the doc example shows, given these two frames:
df = pd.DataFrame(
{
"col1": ["a", "a", "b", "b", "a"],
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
"col3": [1.0, 2.0, 3.0, 4.0, 5.0]
},
columns=["col1", "col2", "col3"],
)
df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0
You can find the different rows with compare :
df.compare(df2)
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0
In this case compare returns one row for every row that has a difference and only the columns that are actually different.
compare can return the equal values or all original values as well:
df.compare(df2, keep_equal=True)
col1 col3
self other self other
0 a c 1.0 1.0
2 b b 3.0 4.0
or
df.compare(df2, keep_shape=True, keep_equal=True)
col1 col2 col3
self other self other self other
0 a c 1.0 1.0 1.0 1.0
1 a a 2.0 2.0 2.0 2.0
2 b b 3.0 3.0 3.0 4.0
3 b b NaN NaN 4.0 4.0
4 a a 5.0 5.0 5.0 5.0
Solution 3:[3]
Just because you have 10 less entries doesn't mean you're going to find duplicates. You probably already have duplicates inside the first dataframe.
Demo:
# Length: 7
df1 = pd.DataFrame({'col1': list('AAABCDE'),
'col2': list('FFFGHIJ'),
'col3': list('1112345')})
# Length: 5
df2 = pd.DataFrame({'col1': list('ABCDE'),
'col2': list('FGHIJ'),
'col3': list('12345')})
Your code:
>>> pd.concat([df1,df2]).drop_duplicates(keep=False)
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
Try:
>>> len(df1.drop_duplicates())
5
>>> len(df2.drop_duplicates())
5
Solution 4:[4]
Assuming both df1 and df2 are Pandas Dataframe, the following code returns True for matching rows and false for the other:
print((df1 == df2).any(1))
If needed to check each and every columns in all the rows, try this:
print((df1 == df2).stack())
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | kelvt |
| Solution 2 | Panagiotis Kanavos |
| Solution 3 | Corralien |
| Solution 4 | Chithambara Karthikeyan |
