'One-hot vs Grouping for Feature Engineering [closed]
**note I don't have 10 rep yet so I can't post images
Working with the Adult Census (goal is to predict which observed people will have an annual income greater than $50k/year) dataset for some ML practice and had a question for feature engineering...
The dataset has columns, of which 8 are categorical (workclass, education (dropped because integer education.num exists), marital.status, occupation, relationship, race, sex, native.country, and income)
In doing analysis, I first changed income to 1 for >$50K/year and 0 for <$50K/year.
data['income'] = data['income'].replace({'<=50K':0, '>50K' :1})
However, when looking at the other variables, I needed some guidance/advice on how to approach them. For example, the 'workclass' column
plt.figure(figsize = (15,5))
sns.barplot(x = data['workclass'], y = data['income'])
plt.xlabel('Working Class')
plt.ylabel('Likelihood of income >= 50K')
plt.show()
My first idea was to use one-hot encoding, however, like workclass, native.country, race,marital.status, and occupation are all unordered. This would create nearly 100 columns.
My next idea was to manually group them based on the probability of a certain column value having an income >$50K, picked based on plots like the one below
Going by this, my inclination for each column would be
| Column | Feature Engineering Decision |
|---|---|
| workclass | drop |
| marital.status | Group (married+present = 1, not married/estranged = 0 |
| occupation | Group (white collar jobs (exec,prof,tech,sales) = 1, blue collar (all else) = 0 |
| race | Unsure, only 5 variables so could one-hot or group by white vs non-white? |
| relationship | Group (Husband or Wife = 1, No marital relationship = 0 |
| sex | One-hot, or Male = 1, Female = 0, Unsure need input |
| native.country | Tons of variables, I think Group by US vs non-US makes most sense |
Here is a link to the full jupyter notebook with graphs for all categorical variables. So, can you help me decide if this is the right way to feature engineer the columns in my dataset?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
