'Missing categorical data should be encoded with an all-zero one-hot vector
I am working on a machine learning project with very sparsely labeled data. There are several categorical features, resulting in roughly one hundred different classes between the features.
For example:
0 red
1 blue
2 <missing>
color_cat = pd.DataFrame(['red', 'blue', np.NAN])
color_enc = OneHotEncoder(sparse=True, handle_unknown='ignore')
color_one_hot = color_enc.fit_transform(color_cat)
After I put these through scikit's OneHotEncoder I am expecting the missing data to be encoded as 00, since the docs state that handle_unknown='ignore' causes the encoder to return an all zero array. Substituting another value, such as with [SimpleImputer][1] is not an option for me.
What I expect:
0 10
1 01
2 00
Instead OneHotEncoder treats the missing values as another category.
What I get:
0 100
1 010
2 001
I have seen the related question: How to handle missing values (NaN) in categorical data when using scikit-learn OneHotEncoder? But the solutions do not work for me. I explicitly require a zero vector.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
