'Get a new dataframe by not matching rows

I have two dataframes organised in this way:

df1
col_1 col_2 col_3
  a banana red 
  b apple blue
  c orange green
  d



df2
col_1 col_2 col_3
  a                                         
  b apple blue
  c orange green
  d

Both come from a dataframe that is complete in all its rows, the two dataframes above are the result of a filter that I applied in the column "col_2" and in the column "col_3" where I label NaN values everything that does not fit in the filter. I would like to compare the two dataframes obtained from the filter by isolating the rows that get the "Nan" label once I have applied a wider filter.

Example of an expected result:

[IN]df1.merge(df2, on= ["col_1", "col_2"])


[OUT] 
col_1 col_2 col_3
a banana red 

How can I do this? Thank you in advance for your reply

Let me explain better:

When I re-apply the filter on df1 by raising the thresholds, the values of those two columns tend to decrease. The original dataframe has about 50,000 rows without any null values. As I apply the filter to the original dataframe and raise the thresholds more and more, the null values in those two columns tend to increase, reducing the non-null values from 50,000 to 45,000 as I raise the thresholds. I am particularly interested in getting those 5,000 values that I lost from the preceding dataframe before I applied the filter. That is my goal.



Solution 1:[1]

I think you are trying to say : keep the rows that are unique (i.e., do not exactly match in all columns) between the 2 dataframes. In that case, a way to do that is to combine the 2 dataframes, and then remove all duplicate rows.

import pandas as pd

df1 = pd.DataFrame({
    'c1': ['a', 'b', 'c', 'd'],
    'c2': ['banana', 'apple', 'orange', None],
    'c3': ['red', 'blue', 'green', None]})

df2 = pd.DataFrame({
    'c1': ['a', 'b', 'c', 'd'],
    'c2': [None, 'apple', 'orange', None],
    'c3': [None, 'blue', 'green', None]})

print(df1)
print()
print(df2)

df_all_together = pd.concat([df1, df2])
df_unique_rows = df_all_together.drop_duplicates(subset=['c2', 'c3'], keep=False)

print(40*'-')
print(df_unique_rows)

  c1      c2     c3
0  a  banana    red
1  b   apple   blue
2  c  orange  green
3  d    None   None

  c1      c2     c3
0  a    None   None
1  b   apple   blue
2  c  orange  green
3  d    None   None
----------------------------------------
  c1      c2   c3
0  a  banana  red

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1