'What's the fastest way to select all rows in one pandas dataframe that do not exist in another?
Beginning with two pandas dataframes of different shapes, what is the fastest way to select all rows in one dataframe that do not exist in the other (or drop all rows in one dataframe that already exist in the other)? And are the fastest methods different for string-valued columns vs. numeric columns? Operation should be roughly equivalent to the code below
import pandas as pd
string_df1 = pd.DataFrame({'latin':['a', 'b', 'c'],
'greek':['alpha', 'beta', 'gamma']})
string_df2 = pd.DataFrame({'latin':['z', 'c'],
'greek':['omega', 'gamma']})
numeric_df1 = pd.DataFrame({'A':[1, 2, 3],
'B':[1.01, 2.02, 3.03]})
numeric_df2 = pd.DataFrame({'A':[3, 9],
'B':[3.03, 9.09]})
def index_matching_rows(df1, df2, cols_to_match=None):
'''
return index of subset of rows of df1 that are equal to at least one row in df2
'''
if cols_to_match is None:
cols_to_match = df1.columns
df1 = df1.reset_index()
m = df1.merge(df2, on=cols_to_match[0], suffixes=('1','2'))
query = '&'.join(['{0}1 == {0}2'.format(str(c)) for c in cols_to_match[1:]])
m = m.query(query)
return m['index']
print(string_df2.drop(index_matching_rows(string_df2, string_df1)))
print(numeric_df2.drop(index_matching_rows(numeric_df2, numeric_df1)))
output
latin greek
0 z omega
A B
1 9 9.09
some naive performance testing
copies = 10
big_sdf1 = pd.concat([string_df1, string_df1]*copies)
big_sdf2 = pd.concat([string_df2, string_df2]*copies)
big_ndf1 = pd.concat([numeric_df1, numeric_df1]*copies)
big_ndf2 = pd.concat([numeric_df2, numeric_df2]*copies)
%%timeit
big_sdf2.drop(index_matching_rows(big_sdf2, big_sdf1))
# copies = 10: 2.61 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 20: 4.44 ms ± 43.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 30: 18.4 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 40: 74.6 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 100: 19.2 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
big_ndf2.drop(index_matching_rows(big_ndf2, big_ndf1))
# copies = 10: 2.56 ms ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 20: 4.38 ms ± 75.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 30: 18.3 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 40: 76.5 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This code runs about as quickly for strings as for numeric data, and I think it's exponential in the length of the dataframe (the curve above is 1.6*exp(0.094x), fit to the string data). I'm working with dataframes that are on the order of 1e5 rows, so this is not a solution for me.
Here's the same performance check for Raymond Kwok's (accepted) answer below in case anyone can beat it later. It's O(n).
%%timeit
big_sdf1_tuples = big_sdf1.apply(tuple, axis=1)
big_sdf2_tuples = big_sdf2.apply(tuple, axis=1)
big_sdf2_tuples.isin(big_sdf1_tuples)
# copies = 100: 4.82 ms ± 22 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1000: 44.6 ms ± 386 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 1e4: 450 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# copies = 1e5: 4.42 s ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
big_ndf1_tuples = big_ndf1.apply(tuple, axis=1)
big_ndf2_tuples = big_ndf2.apply(tuple, axis=1)
big_ndf2_tuples.isin(big_ndf1_tuples)
# copies = 100: 4.98 ms ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1000: 47 ms ± 288 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 1e4: 461 ms ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# copies = 1e5: 4.58 s ± 30.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Indexing into the longest dataframe with
big_sdf2_tuples.loc[~big_sdf2_tuples.isin(big_sdf1_tuples)]
to recover the equivalent of the output in my code above adds about 10 ms.
Solution 1:[1]
Beginning with 2 dataframes:
df1 = {'Runner': ['A', 'A', 'A', 'A'],
'Day': ['1', '3', '8', '9'],
'Miles': ['3', '4', '4', '2']}
df2 = df.copy().drop([1,3])
where the 2nd has two rows less.
We can hash the rows:
df1_hashed = df1.apply(tuple, axis=1).apply(hash)
df2_hashed = df2.apply(tuple, axis=1).apply(hash)
and believe, like many people will, that 2 different rows are very very very unlikely to get the same hashed value,
and get rows from df1 that do not exist in df2:
df1[~df1_hashed.isin(df2_hashed)]
Runner Day Miles
1 A 3 4
3 A 9 2
As for the speed difference between string/integers, I am sure you can test it with your real data.
Note 1: you may actually remove .apply(hash) from both lines.
Note 2: check the answer of this question out for more on isin and the use of hash.
Solution 2:[2]
pandas has a built-in hashing utility that's more than an order of magnitude faster than series of tuples:
%%timeit
big_sdf1_hashed = pd.util.hash_pandas_object(big_sdf1)
big_sdf2_hashed = pd.util.hash_pandas_object(big_sdf2)
big_sdf1.loc[~big_sdf1_hashed.isin(big_sdf2_hashed)]
# copies = 100: 1.05 ms ± 9.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# copies = 1000: 1.99 ms ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1e4: 10.5 ms ± 47.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1e5: 126 ms ± 747 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 1e7: 14.1 s ± 78.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
big_ndf1_hashed = pd.util.hash_pandas_object(big_ndf1)
big_ndf2_hashed = pd.util.hash_pandas_object(big_ndf2)
big_ndf1.loc[~big_ndf1_hashed.isin(big_ndf2_hashed)]
# copies = 100: 496 µs ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# copies = 1000: 772 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# copies = 1e4: 3.88 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1e5: 67.5 ms ± 775 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
And note that the difference in performance comes from creating the objects to be compared rather than searching series of different objects. For copies = int(1e5):
%%timeit
big_ndf1_hashed = pd.util.hash_pandas_object(big_ndf1)
# 25 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
big_ndf1_tuples = big_ndf1.apply(tuple, axis=1)
# 2.53 s ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
the hashed series is also three times smaller on disk than the tuples ( 9 mb vs. 33)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |

