'Find changed rows (subset of rows) in two dataframes

I have 2 csv files that I receive daily which contain 2 columns. One has a unique identifier and the other has a value. I am trying to compare the values for every identifier and keep those rows for which the value has changed.

Id1        Value_today
abc        500
def        600
ghi        700
jkl        800

and

Id1        Value_Yesterday
abc        500
def        650
ghi        750
jkl        800
mno        900

The desired output would be

Id1        Value
def        650
ghi        750
mno        900

So far I have tried:

df4 = df1.merge(df2,how='outer',on='Id1')
df4['check']=df4.Value_today==df4.Value_Yesterday
df4 = df4[df4['check'] == False]

But this returns an empty dataframe. Am I doing something wrong?



Solution 1:[1]

I'd do it this way:

In [11]: d1
Out[11]:
   Id1  Value
0  abc    500
1  def    600
2  ghi    700
3  jkl    800

In [12]: d2
Out[12]:
   Id1  Value
0  abc    500
1  def    650
2  ghi    750
3  jkl    800
4  mno    900

In [13]: %paste
r = pd.concat([d1.assign(x='old'), d2.assign(x='new')], ignore_index=True) \
      .drop_duplicates(subset=['Id1','Value'], keep=False) \
      .set_index(['Id1','x']) \
      .unstack() \
      .reset_index()
r.columns = [col[1] if col[1] else col[0] for col in r.columns]
## -- End pasted text --

Result:

In [16]: r
Out[16]:
   Id1    new    old
0  def  650.0  600.0
1  ghi  750.0  700.0
2  mno  900.0    NaN

In [17]: r[['Id1','old','new']]
Out[17]:
   Id1    old    new
0  def  600.0  650.0
1  ghi  700.0  750.0
2  mno    NaN  900.0

More generic case (with multiple columns):

In [27]: d1
Out[27]:
   ID  a  b  c
0   1  1  2  3
1   2  4  5  6
2   3  7  8  9

In [28]: d2
Out[28]:
   ID  a  b   c
0   1  1  2   3
1   2  4  0   6
2   3  7  8  10
3   4  1  1   1

In [29]: %paste
r = pd.concat([d1.assign(x='old'), d2.assign(x='new')], ignore_index=True) \
      .drop_duplicates(subset=d1.columns.tolist(), keep=False) \
      .set_index(['ID','x']) \
      .unstack() \
      .reset_index()
## -- End pasted text --

In [30]: r
Out[30]:
  ID    a         b          c
x     new  old  new  old   new  old
0  2  4.0  4.0  0.0  5.0   6.0  6.0
1  3  7.0  7.0  8.0  8.0  10.0  9.0
2  4  1.0  NaN  1.0  NaN   1.0  NaN

Solution 2:[2]

This piece of code should work for any number of columns, it will keep every row from the new_df which does not already exist in old_df, i.e. only the new and changed rows.

Note: The code assumes that the column count and column names in the two dataframes are exactly same.

def filter_new_and_changed_rows(new_df, old_df):
    old_indexes = pd.MultiIndex.from_arrays([old_df[col] for col in old_df.columns])
    new_indexes = pd.MultiIndex.from_arrays([new_df[col] for col in new_df.columns])

    changed_df = new_df.loc[~new_indexes.isin(old_indexes)]

    return changed_df

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2