'How to reduce the number of columns after One-hot encoding
I am working with a dataset that requires converting a categorical column into a numeric equivalent as the dataset requires a couple of ML techniques to be implemented. I used one-hot encoding technique to convert the categorical column (i.e. Nationalities) into numeric columns suitable for machine learning models. However, this technique tends to return a total of 227 columns. Just wanted to know if there is a way to reduce the number of columns obtained after implementing OHE. thanks.
The image is attached Image.
Solution 1:[1]
You can use pd.factorize.
df['Nationalities_numeric'] = pd.factorize(df['Nationalities'])[0]
print(df)
# Output
Nationalities Nationalities_numeric
0 France 0
1 Spain 1
2 Italia 2
3 France 0
4 Italia 2
5 Germany 3
Instead of pd.get_dummies:
df = df.join(pd.get_dummies(df['Nationalities']))
print(df)
# Output
Nationalities France Germany Italia Spain
0 France 1 0 0 0
1 Spain 0 0 0 1
2 Italia 0 0 1 0
3 France 1 0 0 0
4 Italia 0 0 1 0
5 Germany 0 1 0 0
Setup:
df = pd.DataFrame({'Nationalities': ['France', 'Spain', 'Italia',
'France', 'Italia', 'Germany']})
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Corralien |
