'How to transform text categories into ones and zeros vector in Python for classification?

I'm preparing data for multilabel classification and I want to transform text categories into zeros/ones vector.

Example input (dataframe):

['a']
['b']
['e', 'd']
['a', 'g', 'x']

etc. example output:

[1 0 0 0 0 0 0 ... 0]
[0 1 0 0 0 0 0 ... 0]
[0 0 0 1 1 0 0 ... 0]
[1 0 0 0 0 0 1 ... 0]

etc.

There's 200 categories. I haven't found how to do it when you have to deal with multilabel classification. Any help would be appreciated.



Solution 1:[1]

Try using a combination of pd.get_dummies (designed for one-hot-encoding) and explode + groupby(level=0):`

out = pd.get_dummies(df['a'].explode()).groupby(level=0).sum()

Output:

>>> out
   a  b  d  e  g  x
0  1  0  0  0  0  0
1  0  1  0  0  0  0
2  0  0  1  1  0  0
3  1  0  0  0  1  1

Or a little shorter:

out = df['a'].str.join(',').str.get_dummies(',')

Output:

>>> out
   a  b  d  e  g  x
0  1  0  0  0  0  0
1  0  1  0  0  0  0
2  0  0  1  1  0  0
3  1  0  0  0  1  1

Then...

df['class'] = out.to_numpy().tolist()

Output:

>>> df
           a               class
0        [a]  [1, 0, 0, 0, 0, 0]
1        [b]  [0, 1, 0, 0, 0, 0]
2     [e, d]  [0, 0, 1, 1, 0, 0]
3  [a, g, x]  [1, 0, 0, 0, 1, 1]

Solution 2:[2]

Let us just try MultiLabelBinarizer from sklearn

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
 
s = mlb.fit_transform(df['a'])
s
Out[239]: 
array([[1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0],
       [1, 0, 0, 0, 1, 1]])

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 BENY