'Dataframe check if 2 columns contain same elements [duplicate]

I have a data frame and with 2 columns X & Y.

df = pd.DataFrame({ 
'X': ['a', 'a,b,c', 'a,d', 'e,f', 'a,c,d,f', 'e'], 
'Y': ['a', 'a,c,b', 'd,a', 'e,g', 'a,d,f,g', 'e']
})

I want to create a new column('Match') in the dataframe such if the columns X & Y have the same elements, then True else False.

df = pd.DataFrame({ 
'X': ['a', 'a,b,c', 'a,d', 'e,f', 'a,c,d,f', 'e'], 
'Y': ['a', 'a,c,b', 'd,a', 'e,g', 'a,d,f,g', 'e'],
'Match':['True','True','True','False','False','True']
})

Kindly help me with this



Solution 1:[1]

You can try split the column to list then sort and compare.

df['Match2'] = df['X'].str.split(',').apply(sorted) == df['Y'].str.split(',').apply(sorted)

Or you can convert list to set and compare depending on if you want duplicated

df['Match2'] = df['X'].str.split(',').apply(set) == df['Y'].str.split(',').apply(set)
print(df)

         X        Y  Match  Match2
0        a        a   True    True
1    a,b,c    a,c,b   True    True
2      a,d      d,a   True    True
3      e,f      e,g  False   False
4  a,c,d,f  a,d,f,g  False   False
5        e        e   True    True

To avoid repeating, you can do

df['Match'] = df[['X', 'Y']].apply(lambda col: col.str.split(',').apply(sorted)).eval('X == Y')

Solution 2:[2]

Lots of ways to do this, one way would be to explode your arrays, sort them and match for equality.

import numpy as np

df1 = df.stack()\
        .str.split(',')\
            .explode()\
            .sort_values()\
            .groupby(level=[0,1])\
            .agg(list).unstack(1)


df['match'] = np.where(df1['X'].eq(df1['Y']),True,False)

         X        Y  match
0        a        a   True
1    a,b,c    a,c,b   True
2      a,d      d,a   True
3      e,f      e,g  False
4  a,c,d,f  a,d,f,g  False
5        e        e   True

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ynjxsjmh
Solution 2 Umar.H