'How to do "(df1 & not df2)" dataframe merge in pandas?
I have 2 pandas dataframes df1 & df2 with common columns/keys (x,y).
I want to merge do a "(df1 & not df2)" kind of merge on keys (x,y), meaning I want my code to return a dataframe containing rows with (x,y) only in df1 & not in df2.
SAS has an equivalent functionality
data final;
merge df1(in=a) df2(in=b);
by x y;
if a & not b;
run;
Who to replicate the same functionality in pandas elegantly? It would have been great if we can specify how="left-right" in merge().
Solution 1:[1]
I just upgraded to version 0.17.0 RC1 which was released 10 days ago. Just found out that pd.merge() have new argument in this new release called indicator=True to acheive this in pandonic way!!
df=pd.merge(df1,df2,on=['x','y'],how="outer",indicator=True)
df=df[df['_merge']=='left_only']
indicator: Add a column to the output DataFrame called _merge with information on the source of each row. _merge is Categorical-type and takes on a value of left_only for observations whose merge key only appears in 'left' DataFrame, right_only for observations whose merge key only appears in 'right' DataFrame, and both if the observation’s merge key is found in both.
Solution 2:[2]
Another way to do this is to use the index.
If i1 and i2 are indices (sets of indices) then i1.difference(i2) represent those indices in i1 and not in i2. Then if df is a dataframe indexed by the same index type , for instance i1=df.index then pd.DataFrame(index=i1.difference(i2)).join(df) are those entries in df whose index is not in the index i2.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | vaudt |
