'Encoding categorical features stored as lists in Pandas DataFrame

I have a Pandas series where categorical data stored as lists. I am trying to extract and encode them as categorical features for ML. I am not sure if the encoders from sklearn.preprocessing or any out of the box solution would work with lists.

import pandas as pd

df = pd.DataFrame({
                   'id': [0, 1, 2],
                   'col1': ['yy','xx','zz'],
                   'col2': ['oo','pp','rr'],
                   'cats': [['A','B','C'],
                            ['U','O','T'],
                            ['T','C','U']]
                 })

Expected output:

id col1 col2 A B C U O T 
0  yy   oo   1 1 1 0 0 0
1  xx   pp   0 0 0 1 1 1
2  zz   rr   0 0 1 1 0 1


Solution 1:[1]

Second answer

If you only have one column of lists, the explode should still work by handling it separately:

>>> df = pd.DataFrame({
                   'id': [0, 1, 2],
                   'col1': ['yy','xx','zz'],
                   'col2': ['oo','pp','rr'],
                   'cats': [['A','B','C'],
                            ['U','O','T'],
                            ['T','C','U']]
                 })
>>> indexed = df.set_index('id')
>>> nolists = indexed.drop(columns=['cats'])
>>> exp = indexed['cats'].explode()
>>> enc = pd.crosstab(exp.index, exp)
>>> result = pd.concat([nolists, enc], axis=1).rename_axis(index='id').reset_index()
>>> result
   id col1 col2  A  B  C  O  T  U
0   0   yy   oo  1  1  1  0  0  0
1   1   xx   pp  0  0  0  1  1  1
2   2   zz   rr  0  0  1  0  1  1

First answer

Are you sure that's your expected output? Why are there two U columns?

Try this:

>>> df = pd.DataFrame({
                   'id': [0, 1, 2],
                   'cats': [['A','B','C'],
                            ['U','O','T'],
                            ['T','C','U']]
                 })
>>> exp = df.explode('cats')
>>> enc = pd.crosstab(exp['id'], exp['cats'])
>>> enc

cats  A  B  C  O  T  U
id                    
0     1  1  1  0  0  0
1     0  0  0  1  1  1
2     0  0  1  0  1  1

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1