'Dataframe check if 2 columns contain same elements [duplicate]
I have a data frame and with 2 columns X & Y.
df = pd.DataFrame({
'X': ['a', 'a,b,c', 'a,d', 'e,f', 'a,c,d,f', 'e'],
'Y': ['a', 'a,c,b', 'd,a', 'e,g', 'a,d,f,g', 'e']
})
I want to create a new column('Match') in the dataframe such if the columns X & Y have the same elements, then True else False.
df = pd.DataFrame({
'X': ['a', 'a,b,c', 'a,d', 'e,f', 'a,c,d,f', 'e'],
'Y': ['a', 'a,c,b', 'd,a', 'e,g', 'a,d,f,g', 'e'],
'Match':['True','True','True','False','False','True']
})
Kindly help me with this
Solution 1:[1]
You can try split the column to list then sort and compare.
df['Match2'] = df['X'].str.split(',').apply(sorted) == df['Y'].str.split(',').apply(sorted)
Or you can convert list to set and compare depending on if you want duplicated
df['Match2'] = df['X'].str.split(',').apply(set) == df['Y'].str.split(',').apply(set)
print(df)
X Y Match Match2
0 a a True True
1 a,b,c a,c,b True True
2 a,d d,a True True
3 e,f e,g False False
4 a,c,d,f a,d,f,g False False
5 e e True True
To avoid repeating, you can do
df['Match'] = df[['X', 'Y']].apply(lambda col: col.str.split(',').apply(sorted)).eval('X == Y')
Solution 2:[2]
Lots of ways to do this, one way would be to explode your arrays, sort them and match for equality.
import numpy as np
df1 = df.stack()\
.str.split(',')\
.explode()\
.sort_values()\
.groupby(level=[0,1])\
.agg(list).unstack(1)
df['match'] = np.where(df1['X'].eq(df1['Y']),True,False)
X Y match
0 a a True
1 a,b,c a,c,b True
2 a,d d,a True
3 e,f e,g False
4 a,c,d,f a,d,f,g False
5 e e True
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Ynjxsjmh |
| Solution 2 | Umar.H |
