'Cannot convert column values to binary values

Have this census datasheet, I'm trying to replace the values of column "Income" to either 1 or 0 (1 for ">50K", 0 for "<=50K") codes are as below, but I got below errors, also tried "train['Income'].replace({0:'<=50K.',1:'>50K.'},inplace=True)" and it also failed, anybody have any idea how we can solve this? Thanks!

link to the data cvs file: https://github.com/amandawang-dev/census-training/blob/master/census-training.csv

sample data: enter image description here


Error I got:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-e590c8a3ce79> in <module>()
     20 
     21 #train['Gender'] = 
train['Gender'].str.contains('Male').astype(int)
---> 22 income_to_numeric(train)
     23 print(train['Income'])
     24 

<ipython-input-5-e590c8a3ce79> in income_to_numeric(x)
     17 def income_to_numeric(x):
     18     income = {'>50K.': 1,'<=50K.': 0}
---> 19     x.Income = [income[item] for item in x.Income]
     20 
     21 #train['Gender'] = 
train['Gender'].str.contains('Male').astype(int)

<ipython-input-5-e590c8a3ce79> in <listcomp>(.0)
     17 def income_to_numeric(x):
     18     income = {'>50K.': 1,'<=50K.': 0}
---> 19     x.Income = [income[item] for item in x.Income]
     20 
     21 #train['Gender'] = 
train['Gender'].str.contains('Male').astype(int)

KeyError: '<=50K'

data columns are like:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 10 columns):
Age              48842 non-null int64
EducationNum     48842 non-null int64
MaritalStatus    48842 non-null object
Occupation       48842 non-null object
Relationship     48842 non-null object
Race             48842 non-null object
Gender           48842 non-null object
Hours/Week       48842 non-null int64
Country          48842 non-null object
Income           48842 non-null object
dtypes: int64(3), object(7)
memory usage: 3.7+ MB
None

codes:

import pandas as pd
import numpy as np
from sklearn import preprocessing, metrics
from sklearn.model_selection import train_test_split


train=pd.read_csv('census-training.csv')
train = train.replace('?', np.nan)

for column in train.columns:
    train[column].fillna(train[column].mode()[0], inplace=True)
#########*********************
####*******my original codes to binarize the gender values:
def gender_to_numeric(x):
    #TODO return 1 f gender is Male, 0 otherwise
    gender = {'Male': 1,'Female': 0} 
    x.Gender = [gender[item] for item in x.Gender]
gender_to_numeric(train) ###this works, no error

def income_to_numeric(x):
    income = {'>50K.': 1,'<=50K.': 0} 
    x.Income = [income[item] for item in x.Income]

income_to_numeric(train) ##this will have the error below


Solution 1:[1]

You are trying to put ints in cells that have strings (and are type object). That won't work. The column can't have both int64 and object types. OneHotEncoder is standard for machine learning, but it has a different output.

One way to make a dummy variable column:

train['Gender'] = train.Gender.map({'Male': 1,'Female': 0})

Another way:

train['Gender'] = train['Gender'].str.contains('Male').astype(int)

train['Gender'].str.contains('Male') makes a column of True and False. astype(int) turns that Boolean into 1s and 0s.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1