'Pandas drop duplicates if reverse is present between two columns
I have a dataset with 2 columns like the following...
InteractorA InteractorB
AGAP028204 AGAP005846
AGAP028204 AGAP003428
AGAP028200 AGAP011124
AGAP028200 AGAP004335
AGAP028200 AGAP011356
AGAP028194 AGAP008414
I'm using Pandas and I want to drop rows which are present twice but simply reversed like the following... from this...
InteractorA InteractorB
AGAP002741 AGAP008026
AGAP008026 AGAP002741
To this...
InteractorA InteractorB
AGAP002741 AGAP008026
As they are for all intents and purposes the same thing.
Is there a built in method to handle this?
Solution 1:[1]
This is the cleanest solution I've managed to make work for my own purposes.
Create a column that has each row combined in a sorted list
df['sorted_row'] = [sorted([a,b]) for a,b in zip(df.InteractorA, df.InteractorB)]
Can't drop duplicates on a list so that column should be a string
df['sorted_row'] = df['sorted_row'].astype(str)
Drop Duplicates
df.drop_duplicates(subset=['sorted_row'], inplace=True)
Solution 2:[2]
I think the following would work:
In [37]:
import pandas as pd
import io
temp = """InteractorA InteractorB
AGAP028204 AGAP005846
AGAP028204 AGAP003428
AGAP028200 AGAP011124
AGAP028200 AGAP004335
AGAP028200 AGAP011356
AGAP028194 AGAP008414
AGAP002741 AGAP008026
AGAP008026 AGAP002741"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')
df
Out[37]:
InteractorA InteractorB
0 AGAP028204 AGAP005846
1 AGAP028204 AGAP003428
2 AGAP028200 AGAP011124
3 AGAP028200 AGAP004335
4 AGAP028200 AGAP011356
5 AGAP028194 AGAP008414
6 AGAP002741 AGAP008026
7 AGAP008026 AGAP002741
So I downloaded your data and misunderstood what you wanted so the following will now work:
# first get the values that are unique
In [72]:
df1 = df[~df.InteractorA.isin(df.InteractorB)]
df1.shape
Out[72]:
(2386, 2)
Now we want to get the duplicated rows but take the first value:
In [74]:
df2 = df[df.InteractorA.isin(df.InteractorB)]
df2 = df2.groupby('InteractorA').first().reset_index()
df2.shape
Out[74]:
(3074, 2)
now concat the 2 dataframes:
In [75]:
merged = pd.concat([df1, df2], ignore_index=True)
merged.shape
Out[75]:
(5460, 2)
I think this is now correct.
Solution 3:[3]
Was looking to solve a similar problem today. The answer by A.Kot put me in the right direction. Below is a working example. Copied the data preparation from the answer by EdChum.
import io
temp = """InteractorA InteractorB
AGAP028204 AGAP005846
AGAP028204 AGAP003428
AGAP028200 AGAP011124
AGAP028200 AGAP004335
AGAP028200 AGAP011356
AGAP028194 AGAP008414
AGAP002741 AGAP008026
AGAP008026 AGAP002741"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')
# One liner to drop the duplicates
df.loc[df.apply(lambda x: set(x[['InteractorA', 'InteractorB']]), axis=1).drop_duplicates().index]```
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | A.Kot |
| Solution 2 | |
| Solution 3 | ilyavs |
