'TargetEncoder failing when the target variable is a string
I'm trying to encode a dataset containing a mixture of categorical and numeric features using the TargetEncoder from category_encoders.
The structure of the dataset is the following one:
>>> df.head()
The target column contains only two types of strings "<=50K" and ">50k".
When encoding the dataset using the target encoder like so:
feat = df.iloc[:, :-1]
targ = df.iloc[:, -1]
targenc = ce.TargetEncoder(verbose=1,return_df=True)
dff = targenc.fit_transform(feat, targ)
I got the following exception:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/.pyenv/versions/3.9.9/envs/ml-environment/lib/python3.9/site-packages/pandas/core/nanops.py in _ensure_numeric(x)
1602 try:
-> 1603 x = float(x)
1604 except (TypeError, ValueError):
ValueError: could not convert string to float: '<=50K<=50K<=50K<=50K<=50K<=50K<=50K>50K>50K>50K>50K>50K<=50K<=50K<=50K<=50K<=50K<=50K>50K>50K<=50K<=50K<=50K<=50K>50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K>50K<=50K<=50K<=50K<=50K<=50K>50K>50K<=50K>50K<=50K<=50K<=50K<=50K<=50K<=50K>50K<=50K<=50K<=50K>50K>50K<=50K<=50K>50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K>50K<=50K>50K<=50K<=50K>50K<=50K<=50K<=50K>50K<=50K>50K>50K<=50K<=50K>50K>50K<=50K<=50K<=50K>50K<=50K<=50K<=50K<=50K>50K>50K<=50K<=50K<=50K<=50K>50K<=50K<=50K<=50K<=50K<=50K>50K<=50K>50K<=50K>50K<=50K<=50K<=50K<=50K<=50K<=50K>50K<=50K<=50K<=50K>50K<=50K<=50K<=50K>50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K>50K<=50K<=50K<=50K<=50K<=50K<=50K<=50K>50K<=50K>50K<=50K<=...(this goes on for a while)
I then tried to replace the two target values with 1 and 0 and the program worked:
mapp = {
'<=50K': 0,
'>50K': 1
}
dff = targenc.fit_transform(feat, targ.map(lambda x: mapp[x]))
Why it doesn't work with strings? And why do I need to pass the target variable at all?
Solution 1:[1]
If you pass it as a string would become a multilabel classification problem since could exist n strings for classification. Since its a binary classification problem (more, less then or equal), it must be binary labels.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | GabrielBoehme |

