'What's the fastest way to select all rows in one pandas dataframe that do not exist in another?

Beginning with two pandas dataframes of different shapes, what is the fastest way to select all rows in one dataframe that do not exist in the other (or drop all rows in one dataframe that already exist in the other)? And are the fastest methods different for string-valued columns vs. numeric columns? Operation should be roughly equivalent to the code below

import pandas as pd

string_df1 = pd.DataFrame({'latin':['a', 'b', 'c'],
                           'greek':['alpha', 'beta', 'gamma']})
string_df2 = pd.DataFrame({'latin':['z', 'c'],
                           'greek':['omega', 'gamma']})

numeric_df1 = pd.DataFrame({'A':[1, 2, 3],
                            'B':[1.01, 2.02, 3.03]})
numeric_df2 = pd.DataFrame({'A':[3, 9],
                            'B':[3.03, 9.09]})

def index_matching_rows(df1, df2, cols_to_match=None):
    '''    
    return index of subset of rows of df1 that are equal to at least one row in df2
    '''
    if cols_to_match is None:
        cols_to_match = df1.columns
    
    df1 = df1.reset_index()
    m = df1.merge(df2, on=cols_to_match[0], suffixes=('1','2'))
    
    query = '&'.join(['{0}1 == {0}2'.format(str(c)) for c in cols_to_match[1:]])

    m = m.query(query)
    
    return m['index']

print(string_df2.drop(index_matching_rows(string_df2, string_df1)))
print(numeric_df2.drop(index_matching_rows(numeric_df2, numeric_df1)))

output

  latin  greek
0     z  omega

   A     B
1  9  9.09

some naive performance testing

copies = 10
big_sdf1 = pd.concat([string_df1, string_df1]*copies)
big_sdf2 = pd.concat([string_df2, string_df2]*copies)
big_ndf1 = pd.concat([numeric_df1, numeric_df1]*copies)
big_ndf2 = pd.concat([numeric_df2, numeric_df2]*copies)
%%timeit
big_sdf2.drop(index_matching_rows(big_sdf2, big_sdf1))
# copies =  10:      2.61 ms   ±  27.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies =  20:      4.44 ms   ±  43.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies =  30:     18.4  ms   ± 132   µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies =  40:     74.6  ms   ± 453   µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 100: 19.2       s   ± 112   ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
big_ndf2.drop(index_matching_rows(big_ndf2, big_ndf1))
# copies = 10:  2.56 ms ±    29.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 20:  4.38 ms ±    75.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 30: 18.3  ms ±   194   µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 40: 76.5  ms ± 1.76    ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

data points with an exponential fit whose equation is y = 1.6exp(0.094x)

This code runs about as quickly for strings as for numeric data, and I think it's exponential in the length of the dataframe (the curve above is 1.6*exp(0.094x), fit to the string data). I'm working with dataframes that are on the order of 1e5 rows, so this is not a solution for me.


Here's the same performance check for Raymond Kwok's (accepted) answer below in case anyone can beat it later. It's O(n).

%%timeit
big_sdf1_tuples = big_sdf1.apply(tuple, axis=1)
big_sdf2_tuples = big_sdf2.apply(tuple, axis=1)
big_sdf2_tuples.isin(big_sdf1_tuples)
# copies =  100:     4.82  ms ±     22   µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1000:    44.6   ms ±    386   µs per loop (mean ± std. dev. of 7 runs,  10 loops each)
# copies =  1e4:   450     ms ±  9.44    ms per loop (mean ± std. dev. of 7 runs,   1 loop each)
# copies =  1e5: 4.42       s ± 27.6     ms per loop (mean ± std. dev. of 7 runs,   1 loop each)
%%timeit
big_ndf1_tuples = big_ndf1.apply(tuple, axis=1)
big_ndf2_tuples = big_ndf2.apply(tuple, axis=1)
big_ndf2_tuples.isin(big_ndf1_tuples)
# copies =  100:     4.98 ms ±     28.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1000:    47    ms ±    288   µs per loop (mean ± std. dev. of 7 runs,  10 loops each)
# copies =  1e4:   461    ms ±  4.41    ms per loop (mean ± std. dev. of 7 runs,   1 loop each)
# copies =  1e5: 4.58      s ± 30.1     ms per loop (mean ± std. dev. of 7 runs,   1 loop each)

Indexing into the longest dataframe with

big_sdf2_tuples.loc[~big_sdf2_tuples.isin(big_sdf1_tuples)]

to recover the equivalent of the output in my code above adds about 10 ms.



Solution 1:[1]

Beginning with 2 dataframes:

df1 = {'Runner': ['A', 'A', 'A', 'A'],
 'Day': ['1', '3', '8', '9'],
 'Miles': ['3', '4', '4', '2']}

df2 = df.copy().drop([1,3])

where the 2nd has two rows less.

We can hash the rows:

df1_hashed = df1.apply(tuple, axis=1).apply(hash)
df2_hashed = df2.apply(tuple, axis=1).apply(hash)

and believe, like many people will, that 2 different rows are very very very unlikely to get the same hashed value,

and get rows from df1 that do not exist in df2:

df1[~df1_hashed.isin(df2_hashed)]

  Runner Day Miles
1      A   3     4
3      A   9     2

As for the speed difference between string/integers, I am sure you can test it with your real data.

Note 1: you may actually remove .apply(hash) from both lines.

Note 2: check the answer of this question out for more on isin and the use of hash.

Solution 2:[2]

pandas has a built-in hashing utility that's more than an order of magnitude faster than series of tuples:

%%timeit
big_sdf1_hashed = pd.util.hash_pandas_object(big_sdf1)
big_sdf2_hashed = pd.util.hash_pandas_object(big_sdf2)
big_sdf1.loc[~big_sdf1_hashed.isin(big_sdf2_hashed)]
# copies =  100:      1.05 ms ±      9.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# copies = 1000:      1.99 ms ±     28.6  µs per loop (mean ± std. dev. of 7 runs,  100 loops each)
# copies =  1e4:     10.5  ms ±     47.5  µs per loop (mean ± std. dev. of 7 runs,  100 loops each)
# copies =  1e5:    126    ms ±    747    µs per loop (mean ± std. dev. of 7 runs,   10 loops each)
# copies =  1e7: 14.1       s ± 78.2      ms per loop (mean ± std. dev. of 7 runs,    1 loop each)
%%timeit
big_ndf1_hashed = pd.util.hash_pandas_object(big_ndf1)
big_ndf2_hashed = pd.util.hash_pandas_object(big_ndf2)
big_ndf1.loc[~big_ndf1_hashed.isin(big_ndf2_hashed)]
# copies =  100:    496 µs ±  12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# copies = 1000:    772 µs ±  12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# copies =  1e4:  3.88  ms ± 129   µs per loop (mean ± std. dev. of 7 runs,  100 loops each)
# copies =  1e5: 67.5   ms ± 775   µs per loop (mean ± std. dev. of 7 runs,   10 loops each)

And note that the difference in performance comes from creating the objects to be compared rather than searching series of different objects. For copies = int(1e5):

%%timeit
big_ndf1_hashed = pd.util.hash_pandas_object(big_ndf1)
# 25 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
big_ndf1_tuples = big_ndf1.apply(tuple, axis=1)
# 2.53 s ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

the hashed series is also three times smaller on disk than the tuples ( 9 mb vs. 33)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2