'Missing categorical data should be encoded with an all-zero one-hot vector

I am working on a machine learning project with very sparsely labeled data. There are several categorical features, resulting in roughly one hundred different classes between the features.

For example:

0    red
1    blue
2    <missing>

color_cat = pd.DataFrame(['red', 'blue', np.NAN])
color_enc = OneHotEncoder(sparse=True, handle_unknown='ignore')
color_one_hot = color_enc.fit_transform(color_cat)

After I put these through scikit's OneHotEncoder I am expecting the missing data to be encoded as 00, since the docs state that handle_unknown='ignore' causes the encoder to return an all zero array. Substituting another value, such as with [SimpleImputer][1] is not an option for me.

What I expect:

0    10
1    01
2    00

Instead OneHotEncoder treats the missing values as another category.

What I get:

0    100
1    010
2    001

I have seen the related question: How to handle missing values (NaN) in categorical data when using scikit-learn OneHotEncoder? But the solutions do not work for me. I explicitly require a zero vector.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Missing categorical data should be encoded with an all-zero one-hot vector

Sources

Related Questions