'sklearn.preprocessing.OneHotEncoder and the way to read it

I have been using one-hot encoding for a while now in all pre-processing data pipelines that I have had.

But I have run into an issue now that I am trying to pre-process new data automatically with flask server running a model.

TLDR of what I am trying to do is to search new data for a specific Date, region and type and run a .predict on it.

The problem arises as after I search for a specific data point I have to change the columns from objects to the one-hot encoded ones.

My question is, how do I know which column is for which category inside a feature? As I have around 240 columns after one hot encoding.



Solution 1:[1]

IIUC, use get_feature_names_out():

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 1, 0],
                   'C': [0, 2, 2], 'D': [0, 1, 1]})

ohe = OneHotEncoder()
data = ohe.fit_transform(df)
df1 = pd.DataFrame(data.toarray(), columns=ohe.get_feature_names_out(), dtype=int)

Output:

>>> df
   A  B  C  D
0  0  3  0  0
1  1  1  2  1
2  2  0  2  1


>>> df1
   A_0  A_1  A_2  B_0  B_1  B_3  C_0  C_2  D_0  D_1
0    1    0    0    0    0    1    1    0    1    0
1    0    1    0    0    1    0    0    1    0    1
2    0    0    1    1    0    0    0    1    0    1

>>> pd.Series(ohe.get_feature_names_out()).str.rsplit('_', 1).str[0]
0    A
1    A
2    A
3    B
4    B
5    B
6    C
7    C
8    D
9    D
dtype: object

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Corralien