'How to filter rows from dataframe depending on contents of other rows?
Say I have a dataframe defined as
pd.DataFrame({'col1': ['foo', '', '', 'foo', 'quux', 'baz', 'baz', 'baz'],
'col2': ['', 'gb', '', 'de', 'gb', '', 'es', 'es'],
'col3': [123, float("NaN"), 456, 723, 456, 123, 123, 721],
'col4': ['', '', 'val1', 'val2', 'val3', '', 'val4', 'val5'],
'value': [1, 1, .4, .5, .3, 1, .5, .4]})
Which looks like
| index | col1 | col2 | col3 | col4 | value |
|---|---|---|---|---|---|
| 0 | foo | 123.0 | 1.0 | ||
| 1 | gb | NaN | 1.0 | ||
| 2 | 456.0 | val1 | 0.4 | ||
| 3 | foo | de | 723.0 | val2 | 0.5 |
| 4 | quux | gb | 456.0 | val3 | 0.3 |
| 5 | baz | 123 | 1 | ||
| 6 | baz | es | 123 | val4 | .5 |
| 7 | baz | es | 721 | val5 | 0.4 |
I would like to filter this table and remove any rows where the value is equal to 1.0, but also any rows that have the same values in the populated columns as the value==1.0 rows. So in the above table, we would remove rows 0, 1, and 5 since the value==1.0, and also remove row 3 because col1=='foo' and row 4 because col2=='gb', and row 6 because col1='baz' AND col3=123. Rows 2 and 7 should be retained.
| index | col1 | col2 | col3 | col4 | value |
|---|---|---|---|---|---|
| 2 | 456.0 | val1 | 0.4 | ||
| 7 | baz | es | 721 | val5 | 0.4 |
What's the best way to do this? I could find all the rows where the value==1.0 and then iterate through them and filter out all the rows from the table that have the same values in the set columns, but iterating through dataframe rows isn't ideal. I also thought of doing a merge, but I'm also not sure how to tell a merge to ignore columns where there is no value set.
Solution 1:[1]
Let us do
cond = df.loc[df.value==1,]
filter = df[~(df.col1.isin(cond.col1[cond.col1!=''])|df.col2.isin(cond.col2[cond.col2!='']))]
filter
Out[443]:
col1 col2 col3 col4 value
2 456.0 val1 0.4
Solution 2:[2]
I'd suggest doing a treatment per-columns.
# First get rows where value is 1
temp = df.query('value == 1')
# Then, collect all unique values from the columns of interest.
vals1, vals2 = temp.col1[temp.col1.ne('')].unique(), temp.col2[temp.col2.ne('')].unique()
# Finally, filter.
df.loc[~(np.isin(df.col1, vals1) | np.isin(df.col2, vals2))]
Solution 3:[3]
I usually go with binary slicing with numpy as this is straight forward and (for me) most readable:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': ['foo', '', '', 'foo', 'quux'],
'col2': ['', 'gb', '', 'de', 'gb'],
'col3': [123, float("NaN"), 456, 723, 456],
'col4': ['', '', 'val1', 'val2', 'val3'],
'value': [1, 1, .4, .5, .3]})
target = pd.Series({'value': 1.0, 'col1': 'foo', 'col2': 'gb'})
# determine which rows meet the target specifications
lg = np.all(df[target.index] == target, axis=1)
# using slicing
df = df[~lg]
# using drop
df.drop(lg[lg].index)
the good thing about this is that your are flexible with regard how to proceed with the logical vector lg or the interesting indices lg[lg].index =)
Solution 4:[4]
You can do:
s = set(filter(lambda x:len(str(x)) > 0,
np.ravel(df.loc[df['value'].eq(1.0)].fillna('')[['col1', 'col2']].values)))
df = df[~(df['col1'].isin(s) | df['col2'].isin(s))]
Solution 5:[5]
This should do the work:
eq1 = df[df['value'].eq(1)].replace('', float("NaN"))
df[~df.apply(lambda x: (eq1 == x).any(axis=None), axis=1)]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | BENY |
| Solution 2 | rafaelc |
| Solution 3 | max |
| Solution 4 | |
| Solution 5 |
