'Very slow filtering with multiple conditions in pandas dataframe
UPDATE: I have edited the question (and code) to make the problem clearer. I use here synthetic data but imagine a large df of floods and a small one of significant floods. I want add a reference to every row (of the large_df) if it is somewhat close to the significant flood.
I have 2 pandas dataframes (1 large and 1 small). In every iteration I want to create a subset of the small dataframe based on a few conditions that are dependent on each row (of the large df):
import numpy as np
import pandas as pd
import time
SOME_THRESHOLD = 10.5
MUMBER_OF_ROWS = 2e4
large_df = pd.DataFrame(index=np.arange(MUMBER_OF_ROWS), data={'a': np.arange(MUMBER_OF_ROWS)})
small_df = large_df.loc[np.random.randint(0, MUMBER_OF_ROWS, 5)]
large_df['past_index'] = None
count_time = 0
for ind, row in large_df.iterrows():
start = time.time()
# This line takes forever.
df_tmp = small_df[(small_df.index<ind) & (small_df['a']>(row['a']-SOME_THRESHOLD)) & (small_df['a']<(row['a']+SOME_THRESHOLD))]
count_time += time.time()-start
if not df_tmp.empty:
past_index = df_tmp.loc[df_tmp.index.max()]['a']
large_df.loc[ind, 'similar_past_flood_tag'] = f'Similar to the large flood of {past_index}'
print(f'The total time of creating the subset df for 2e4 rows is: {count_time} seconds.')
The line that creates the subset takes a long time to compute:
The total time of creating the subset df for 2e4 rows is: 18.276793956756592 seconds.
This seems to me to be an too long. I have found similar questions but non of the answers seemed to work (e.g query and numpy conditions). Is there a way to optimize this?
Note: the code does what is expected - just very slow.
Solution 1:[1]
While your code is logically correct, building the many boolean arrays and slicing the DataFrame accumulates to some time..
Here are some stats with %timeit:
- (small_df.index<ind): ~30?s
- (small_df['a']>(row['a']-SOME_THRESHOLD)): ~100?s
- (small_df['a']<(row['a']+SOME_THRESHOLD)): ~100?s
- After '&'-ing all three: ~500?s
- Including the DataFrame slice: ~700?s
That, multiplied by 20K times is indeed 14 seconds.. :)
What you could do is take advantage of numpy's broadcast to compute the boolean matrix more efficiently, and then reconstruct the "valid" DataFrame. See below:
l_ind = np.array(large_df.index)
s_ind = np.array(small_df.index)
l_a = np.array(large_df.a)
s_a = np.array(small_df.a)
arr1 = (l_ind[:, None] < s_ind[None, :])
arr2 = (((l_a[:, None] - SOME_THRESHOLD) < s_a[None, :]) &
(s_a[None, :] < (l_a[:, None] + SOME_THRESHOLD)))
arr = arr1 & arr2
large_valid_inds, small_valid_inds = np.where(arr)
pd.DataFrame({'large_ind': np.take(l_ind, large_valid_inds),
'small_ind': np.take(s_ind, small_valid_inds)})
That gives you the following DF, which if I understood the question properly, is the expected solution:
| large_ind | small_ind | |
|---|---|---|
| 0 | 1621 | 1631 |
| 1 | 1622 | 1631 |
| 2 | 1623 | 1631 |
| 3 | 1624 | 1631 |
| 4 | 1625 | 1631 |
| 5 | 1626 | 1631 |
| 6 | 1627 | 1631 |
| 7 | 1628 | 1631 |
| 8 | 1629 | 1631 |
| 9 | 1630 | 1631 |
| 10 | 1992 | 2002 |
| 11 | 1993 | 2002 |
| 12 | 1994 | 2002 |
| 13 | 1995 | 2002 |
| 14 | 1996 | 2002 |
| 15 | 1997 | 2002 |
| 16 | 1998 | 2002 |
| 17 | 1999 | 2002 |
| 18 | 2000 | 2002 |
| 19 | 2001 | 2002 |
| 20 | 8751 | 8761 |
| 21 | 8752 | 8761 |
| 22 | 8753 | 8761 |
| 23 | 8754 | 8761 |
| 24 | 8755 | 8761 |
| 25 | 8756 | 8761 |
| 26 | 8757 | 8761 |
| 27 | 8758 | 8761 |
| 28 | 8759 | 8761 |
| 29 | 8760 | 8761 |
| 30 | 10516 | 10526 |
| 31 | 10517 | 10526 |
| 32 | 10518 | 10526 |
| 33 | 10519 | 10526 |
| 34 | 10520 | 10526 |
| 35 | 10521 | 10526 |
| 36 | 10522 | 10526 |
| 37 | 10523 | 10526 |
| 38 | 10524 | 10526 |
| 39 | 10525 | 10526 |
| 40 | 18448 | 18458 |
| 41 | 18449 | 18458 |
| 42 | 18450 | 18458 |
| 43 | 18451 | 18458 |
| 44 | 18452 | 18458 |
| 45 | 18453 | 18458 |
| 46 | 18454 | 18458 |
| 47 | 18455 | 18458 |
| 48 | 18456 | 18458 |
| 49 | 18457 | 18458 |
Solution 2:[2]
In pandas for loops are much slower than column operations. So changing the calculation to loop over small_df instead of large_df will already give a big improvement:
for ind, row in small_df.iterrows():
df_tmp = large_df[ <some condition> ]
# ... some other processing
Even better for your case is to use a merge rather than a condition on large_df. The problem is your merge is not on equal columns but on approximately equal. To use this approach, you should truncate your column and use that for the merge. Here's a hacky example.
small_df['a_rounded'] = (small_df['a'] / SOME_THRESHOLD / 2).astype(int)
large_df['a_rounded'] = (large_df['a'] / SOME_THRESHOLD / 2).astype(int)
merge_result = small_df.merge(large_df, on='a_rounded')
small_df['a_rounded2'] = ((small_df['a'] + SOME_THRESHOLD) / SOME_THRESHOLD / 2).astype(int)
large_df['a_rounded2'] = ((large_df['a'] + SOME_THRESHOLD) / SOME_THRESHOLD / 2).astype(int)
merge_result2 = small_df.merge(large_df, on='a_rounded2')
total_merge_result = pd.concat([merge_result, merge_result2])
# Now remove duplicates and impose additional filters.
You can impose the additional filters on the result later.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Zach Moshe |
| Solution 2 | ErnestScribbler |
