'Encoding categorical features stored as lists in Pandas DataFrame
I have a Pandas series where categorical data stored as lists. I am trying to extract and encode them as categorical features for ML. I am not sure if the encoders from sklearn.preprocessing or any out of the box solution would work with lists.
import pandas as pd
df = pd.DataFrame({
'id': [0, 1, 2],
'col1': ['yy','xx','zz'],
'col2': ['oo','pp','rr'],
'cats': [['A','B','C'],
['U','O','T'],
['T','C','U']]
})
Expected output:
id col1 col2 A B C U O T
0 yy oo 1 1 1 0 0 0
1 xx pp 0 0 0 1 1 1
2 zz rr 0 0 1 1 0 1
Solution 1:[1]
Second answer
If you only have one column of lists, the explode should still work by handling it separately:
>>> df = pd.DataFrame({
'id': [0, 1, 2],
'col1': ['yy','xx','zz'],
'col2': ['oo','pp','rr'],
'cats': [['A','B','C'],
['U','O','T'],
['T','C','U']]
})
>>> indexed = df.set_index('id')
>>> nolists = indexed.drop(columns=['cats'])
>>> exp = indexed['cats'].explode()
>>> enc = pd.crosstab(exp.index, exp)
>>> result = pd.concat([nolists, enc], axis=1).rename_axis(index='id').reset_index()
>>> result
id col1 col2 A B C O T U
0 0 yy oo 1 1 1 0 0 0
1 1 xx pp 0 0 0 1 1 1
2 2 zz rr 0 0 1 0 1 1
First answer
Are you sure that's your expected output? Why are there two U columns?
Try this:
>>> df = pd.DataFrame({
'id': [0, 1, 2],
'cats': [['A','B','C'],
['U','O','T'],
['T','C','U']]
})
>>> exp = df.explode('cats')
>>> enc = pd.crosstab(exp['id'], exp['cats'])
>>> enc
cats A B C O T U
id
0 1 1 1 0 0 0
1 0 0 0 1 1 1
2 0 0 1 0 1 1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
