'TargetEncoder failing when the target variable is a string

I'm trying to encode a dataset containing a mixture of categorical and numeric features using the TargetEncoder from category_encoders.

The structure of the dataset is the following one:

>>> df.head()

enter image description here

The target column contains only two types of strings "<=50K" and ">50k".

When encoding the dataset using the target encoder like so:

feat = df.iloc[:, :-1]
targ = df.iloc[:, -1]
targenc = ce.TargetEncoder(verbose=1,return_df=True)
dff = targenc.fit_transform(feat, targ)

I got the following exception:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/.pyenv/versions/3.9.9/envs/ml-environment/lib/python3.9/site-packages/pandas/core/nanops.py in _ensure_numeric(x)
   1602         try:
-> 1603             x = float(x)
   1604         except (TypeError, ValueError):

ValueError: could not convert string to float: '<=50K<=50K<=50K<=50K<=50K<=50K<=50K>50K>50K>50K>50K>50K<=50K<=50K<=50K<=50K<=50K<=50K>50K>50K<=50K<=50K<=50K<=50K>50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K>50K<=50K<=50K<=50K<=50K<=50K>50K>50K<=50K>50K<=50K<=50K<=50K<=50K<=50K<=50K>50K<=50K<=50K<=50K>50K>50K<=50K<=50K>50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K>50K<=50K>50K<=50K<=50K>50K<=50K<=50K<=50K>50K<=50K>50K>50K<=50K<=50K>50K>50K<=50K<=50K<=50K>50K<=50K<=50K<=50K<=50K>50K>50K<=50K<=50K<=50K<=50K>50K<=50K<=50K<=50K<=50K<=50K>50K<=50K>50K<=50K>50K<=50K<=50K<=50K<=50K<=50K<=50K>50K<=50K<=50K<=50K>50K<=50K<=50K<=50K>50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K>50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K>50K<=50K>50K<=50K<=...(this goes on for a while)

I then tried to replace the two target values with 1 and 0 and the program worked:

mapp = {
    '<=50K': 0,
    '>50K': 1
}


dff = targenc.fit_transform(feat, targ.map(lambda x: mapp[x]))

Why it doesn't work with strings? And why do I need to pass the target variable at all?



Solution 1:[1]

If you pass it as a string would become a multilabel classification problem since could exist n strings for classification. Since its a binary classification problem (more, less then or equal), it must be binary labels.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 GabrielBoehme