'Cannot convert column values to binary values
Have this census datasheet, I'm trying to replace the values of column "Income" to either 1 or 0 (1 for ">50K", 0 for "<=50K") codes are as below, but I got below errors, also tried "train['Income'].replace({0:'<=50K.',1:'>50K.'},inplace=True)" and it also failed, anybody have any idea how we can solve this? Thanks!
link to the data cvs file: https://github.com/amandawang-dev/census-training/blob/master/census-training.csv
Error I got:
KeyError Traceback (most recent call last)
<ipython-input-5-e590c8a3ce79> in <module>()
20
21 #train['Gender'] =
train['Gender'].str.contains('Male').astype(int)
---> 22 income_to_numeric(train)
23 print(train['Income'])
24
<ipython-input-5-e590c8a3ce79> in income_to_numeric(x)
17 def income_to_numeric(x):
18 income = {'>50K.': 1,'<=50K.': 0}
---> 19 x.Income = [income[item] for item in x.Income]
20
21 #train['Gender'] =
train['Gender'].str.contains('Male').astype(int)
<ipython-input-5-e590c8a3ce79> in <listcomp>(.0)
17 def income_to_numeric(x):
18 income = {'>50K.': 1,'<=50K.': 0}
---> 19 x.Income = [income[item] for item in x.Income]
20
21 #train['Gender'] =
train['Gender'].str.contains('Male').astype(int)
KeyError: '<=50K'
data columns are like:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 10 columns):
Age 48842 non-null int64
EducationNum 48842 non-null int64
MaritalStatus 48842 non-null object
Occupation 48842 non-null object
Relationship 48842 non-null object
Race 48842 non-null object
Gender 48842 non-null object
Hours/Week 48842 non-null int64
Country 48842 non-null object
Income 48842 non-null object
dtypes: int64(3), object(7)
memory usage: 3.7+ MB
None
codes:
import pandas as pd
import numpy as np
from sklearn import preprocessing, metrics
from sklearn.model_selection import train_test_split
train=pd.read_csv('census-training.csv')
train = train.replace('?', np.nan)
for column in train.columns:
train[column].fillna(train[column].mode()[0], inplace=True)
#########*********************
####*******my original codes to binarize the gender values:
def gender_to_numeric(x):
#TODO return 1 f gender is Male, 0 otherwise
gender = {'Male': 1,'Female': 0}
x.Gender = [gender[item] for item in x.Gender]
gender_to_numeric(train) ###this works, no error
def income_to_numeric(x):
income = {'>50K.': 1,'<=50K.': 0}
x.Income = [income[item] for item in x.Income]
income_to_numeric(train) ##this will have the error below
Solution 1:[1]
You are trying to put ints in cells that have strings (and are type object). That won't work. The column can't have both int64 and object types. OneHotEncoder is standard for machine learning, but it has a different output.
One way to make a dummy variable column:
train['Gender'] = train.Gender.map({'Male': 1,'Female': 0})
Another way:
train['Gender'] = train['Gender'].str.contains('Male').astype(int)
train['Gender'].str.contains('Male')
makes a column of True and False. astype(int)
turns that Boolean into 1s and 0s.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |