'One-hot vs Grouping for Feature Engineering [closed]

**note I don't have 10 rep yet so I can't post images

Working with the Adult Census (goal is to predict which observed people will have an annual income greater than $50k/year) dataset for some ML practice and had a question for feature engineering...

The dataset has columns, of which 8 are categorical (workclass, education (dropped because integer education.num exists), marital.status, occupation, relationship, race, sex, native.country, and income)

these

In doing analysis, I first changed income to 1 for >$50K/year and 0 for <$50K/year.

data['income'] = data['income'].replace({'<=50K':0, '>50K' :1})

However, when looking at the other variables, I needed some guidance/advice on how to approach them. For example, the 'workclass' column

plt.figure(figsize = (15,5))
sns.barplot(x = data['workclass'], y = data['income'])
plt.xlabel('Working Class')
plt.ylabel('Likelihood of income >= 50K')
plt.show()

workclass

My first idea was to use one-hot encoding, however, like workclass, native.country, race,marital.status, and occupation are all unordered. This would create nearly 100 columns.

My next idea was to manually group them based on the probability of a certain column value having an income >$50K, picked based on plots like the one below

Martital.Status

Going by this, my inclination for each column would be

Column	Feature Engineering Decision
workclass	drop
marital.status	Group (married+present = 1, not married/estranged = 0
occupation	Group (white collar jobs (exec,prof,tech,sales) = 1, blue collar (all else) = 0
race	Unsure, only 5 variables so could one-hot or group by white vs non-white?
relationship	Group (Husband or Wife = 1, No marital relationship = 0
sex	One-hot, or Male = 1, Female = 0, Unsure need input
native.country	Tons of variables, I think Group by US vs non-US makes most sense

Here is a link to the full jupyter notebook with graphs for all categorical variables. So, can you help me decide if this is the right way to feature engineer the columns in my dataset?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'One-hot vs Grouping for Feature Engineering [closed]

Sources

Related Questions