'Find changed rows (subset of rows) in two dataframes
I have 2 csv files that I receive daily which contain 2 columns. One has a unique identifier and the other has a value. I am trying to compare the values for every identifier and keep those rows for which the value has changed.
Id1 Value_today
abc 500
def 600
ghi 700
jkl 800
and
Id1 Value_Yesterday
abc 500
def 650
ghi 750
jkl 800
mno 900
The desired output would be
Id1 Value
def 650
ghi 750
mno 900
So far I have tried:
df4 = df1.merge(df2,how='outer',on='Id1')
df4['check']=df4.Value_today==df4.Value_Yesterday
df4 = df4[df4['check'] == False]
But this returns an empty dataframe. Am I doing something wrong?
Solution 1:[1]
I'd do it this way:
In [11]: d1
Out[11]:
Id1 Value
0 abc 500
1 def 600
2 ghi 700
3 jkl 800
In [12]: d2
Out[12]:
Id1 Value
0 abc 500
1 def 650
2 ghi 750
3 jkl 800
4 mno 900
In [13]: %paste
r = pd.concat([d1.assign(x='old'), d2.assign(x='new')], ignore_index=True) \
.drop_duplicates(subset=['Id1','Value'], keep=False) \
.set_index(['Id1','x']) \
.unstack() \
.reset_index()
r.columns = [col[1] if col[1] else col[0] for col in r.columns]
## -- End pasted text --
Result:
In [16]: r
Out[16]:
Id1 new old
0 def 650.0 600.0
1 ghi 750.0 700.0
2 mno 900.0 NaN
In [17]: r[['Id1','old','new']]
Out[17]:
Id1 old new
0 def 600.0 650.0
1 ghi 700.0 750.0
2 mno NaN 900.0
More generic case (with multiple columns):
In [27]: d1
Out[27]:
ID a b c
0 1 1 2 3
1 2 4 5 6
2 3 7 8 9
In [28]: d2
Out[28]:
ID a b c
0 1 1 2 3
1 2 4 0 6
2 3 7 8 10
3 4 1 1 1
In [29]: %paste
r = pd.concat([d1.assign(x='old'), d2.assign(x='new')], ignore_index=True) \
.drop_duplicates(subset=d1.columns.tolist(), keep=False) \
.set_index(['ID','x']) \
.unstack() \
.reset_index()
## -- End pasted text --
In [30]: r
Out[30]:
ID a b c
x new old new old new old
0 2 4.0 4.0 0.0 5.0 6.0 6.0
1 3 7.0 7.0 8.0 8.0 10.0 9.0
2 4 1.0 NaN 1.0 NaN 1.0 NaN
Solution 2:[2]
This piece of code should work for any number of columns, it will keep every row from the new_df which does not already exist in old_df, i.e. only the new and changed rows.
Note: The code assumes that the column count and column names in the two dataframes are exactly same.
def filter_new_and_changed_rows(new_df, old_df):
old_indexes = pd.MultiIndex.from_arrays([old_df[col] for col in old_df.columns])
new_indexes = pd.MultiIndex.from_arrays([new_df[col] for col in new_df.columns])
changed_df = new_df.loc[~new_indexes.isin(old_indexes)]
return changed_df
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
