'Pandas: remove duplicates based on substring [duplicate]
I have the following 2 columns, from a Pandas DataFrame:
antecedents consequents
apple orange
orange apple
apple water
apple pineapple
water lemon
lemon water
I would like to remove duplicates that appear as bot antecedents and consequents, keeping only the first appearing, and thus obtain:
antecedents consequents
apple orange
apple water
apple pineapple
water lemon
How can I achieve that using Pandas?
Solution 1:[1]
Use frozenset by both columns and test duplicates by Series.duplicated:
df2 = df[~df[['antecedents','consequents']].apply(frozenset,axis=1).duplicated()]
Or sorting values per rows in numpy.sort:
df1 = pd.DataFrame(np.sort(df[['antecedents','consequents']], axis=1), index=df.index)
df2 = df[~df1.duplicated()]
print (df2)
antecedents consequents
0 apple orange
2 apple water
3 apple pineapple
4 water lemon
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
