'Pandas drop duplicates if reverse is present between two columns

I have a dataset with 2 columns like the following...

InteractorA InteractorB
AGAP028204  AGAP005846
AGAP028204  AGAP003428
AGAP028200  AGAP011124
AGAP028200  AGAP004335
AGAP028200  AGAP011356
AGAP028194  AGAP008414

I'm using Pandas and I want to drop rows which are present twice but simply reversed like the following... from this...

InteractorA InteractorB
AGAP002741  AGAP008026
AGAP008026  AGAP002741

To this...

InteractorA InteractorB
AGAP002741  AGAP008026

As they are for all intents and purposes the same thing.

Is there a built in method to handle this?



Solution 1:[1]

This is the cleanest solution I've managed to make work for my own purposes.

Create a column that has each row combined in a sorted list

df['sorted_row'] = [sorted([a,b]) for a,b in zip(df.InteractorA, df.InteractorB)]

Can't drop duplicates on a list so that column should be a string

df['sorted_row'] = df['sorted_row'].astype(str)

Drop Duplicates

df.drop_duplicates(subset=['sorted_row'], inplace=True)

Solution 2:[2]

I think the following would work:

In [37]:
import pandas as pd
import io
temp = """InteractorA InteractorB
AGAP028204  AGAP005846
AGAP028204  AGAP003428
AGAP028200  AGAP011124
AGAP028200  AGAP004335
AGAP028200  AGAP011356
AGAP028194  AGAP008414
AGAP002741  AGAP008026
AGAP008026  AGAP002741"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')
df
Out[37]:
  InteractorA InteractorB
0  AGAP028204  AGAP005846
1  AGAP028204  AGAP003428
2  AGAP028200  AGAP011124
3  AGAP028200  AGAP004335
4  AGAP028200  AGAP011356
5  AGAP028194  AGAP008414
6  AGAP002741  AGAP008026
7  AGAP008026  AGAP002741

So I downloaded your data and misunderstood what you wanted so the following will now work:

# first get the values that are unique
In [72]:
df1 = df[~df.InteractorA.isin(df.InteractorB)]
df1.shape
Out[72]:
(2386, 2)

Now we want to get the duplicated rows but take the first value:

In [74]:

df2 = df[df.InteractorA.isin(df.InteractorB)]
df2 = df2.groupby('InteractorA').first().reset_index()
df2.shape
Out[74]:
(3074, 2)

now concat the 2 dataframes:

In [75]:

merged = pd.concat([df1, df2], ignore_index=True)
merged.shape
Out[75]:
(5460, 2)

I think this is now correct.

Solution 3:[3]

Was looking to solve a similar problem today. The answer by A.Kot put me in the right direction. Below is a working example. Copied the data preparation from the answer by EdChum.

import io
temp = """InteractorA InteractorB
AGAP028204  AGAP005846
AGAP028204  AGAP003428
AGAP028200  AGAP011124
AGAP028200  AGAP004335
AGAP028200  AGAP011356
AGAP028194  AGAP008414
AGAP002741  AGAP008026
AGAP008026  AGAP002741"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')

# One liner to drop the duplicates
df.loc[df.apply(lambda x: set(x[['InteractorA', 'InteractorB']]), axis=1).drop_duplicates().index]```

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 A.Kot
Solution 2
Solution 3 ilyavs