'How to transform text categories into ones and zeros vector in Python for classification?
I'm preparing data for multilabel classification and I want to transform text categories into zeros/ones vector.
Example input (dataframe):
['a']
['b']
['e', 'd']
['a', 'g', 'x']
etc. example output:
[1 0 0 0 0 0 0 ... 0]
[0 1 0 0 0 0 0 ... 0]
[0 0 0 1 1 0 0 ... 0]
[1 0 0 0 0 0 1 ... 0]
etc.
There's 200 categories. I haven't found how to do it when you have to deal with multilabel classification. Any help would be appreciated.
Solution 1:[1]
Try using a combination of pd.get_dummies (designed for one-hot-encoding) and explode + groupby(level=0):`
out = pd.get_dummies(df['a'].explode()).groupby(level=0).sum()
Output:
>>> out
a b d e g x
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 0 1 1 0 0
3 1 0 0 0 1 1
Or a little shorter:
out = df['a'].str.join(',').str.get_dummies(',')
Output:
>>> out
a b d e g x
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 0 1 1 0 0
3 1 0 0 0 1 1
Then...
df['class'] = out.to_numpy().tolist()
Output:
>>> df
a class
0 [a] [1, 0, 0, 0, 0, 0]
1 [b] [0, 1, 0, 0, 0, 0]
2 [e, d] [0, 0, 1, 1, 0, 0]
3 [a, g, x] [1, 0, 0, 0, 1, 1]
Solution 2:[2]
Let us just try MultiLabelBinarizer from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s = mlb.fit_transform(df['a'])
s
Out[239]:
array([[1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0],
[1, 0, 0, 0, 1, 1]])
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | BENY |
