'Why should LabelEncoder from sklearn be used only for the target variable?

I was trying to create a pipeline with a LabelEncoder to transform categorical values.

cat_variable = Pipeline(steps = [
    ('imputer',SimpleImputer(strategy = 'most_frequent')),
    ('lencoder',LabelEncoder())
])
                        
num_variable = SimpleImputer(strategy = 'mean')

preprocess = ColumnTransformer (transformers = [
    ('categorical',cat_variable,cat_columns),
    ('numerical',num_variable,num_columns)
])

odel = RandomForestRegressor(n_estimators = 100, random_state = 0)

final_pipe = Pipeline(steps = [
    ('preprocessor',preprocess),
    ('model',model)
])

scores = -1 * cross_val_score(final_pipe,X_train,y,cv = 5,scoring = 'neg_mean_absolute_error')

But this is throwing a TypeError:


TypeError: fit_transform() takes 2 positional arguments but 3 were given

On further reference, I found out that transformers like LabelEncoders are not supposed to be used with features and should only be used on the prediction target.

From Documentation:

class sklearn.preprocessing.LabelEncoder

Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y, and not the input X.

My question is, why can we not use LabelEncoder on feature variables and are there any other transformers that have a condition like this?

Solution 1:^[1]

LabelEncoder can be used to normalize labels or to transform non-numerical labels. For the input categorical you should use OneHotEncoder.

The difference:

le = preprocessing.LabelEncoder()
le.fit_transform([1, 2, 2, 6])
array([0, 0, 1, 2])

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit_transform([[1], [2], [2], [6]]).toarray()
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Solution 2:^[2]

LabelEncoder, by design, has to be used on the target variable and not on feature variables. This implies that the signature of methods .fit(), .transform() and .fit_transform() of the LabelEncoder class differs from the one of the transformers which are meant to be applied on features.

fit(y) vs fit(X[,y]) | transform(y) vs transform(X) | fit_transform(y) vs fit_transform(X[,y]) or similarly

fit(self, y) vs fit(self, X, y=None) | transform(self, y) vs transform(self, X) | fit_transform(self, y) vs fit_transform(self, X, y=None)

respectively for LabelEncoder-like transformers (i.e. transformers to be applied on target) and for transformers to be applied on features.

This same design also holds for LabelBinarizer and MultiLabelBinarizer. I would suggest the reading of the Transforming the prediction target (y) paragraph of the User Guide.

This said, here are a couple of considerations describing what happens when you try to use LabelEncoder in a Pipeline or in a ColumnTransformer:

Pipelines and ColumnTransformers are about transforming and fitting data, not targets. They somehow "assume" the target is already in a state that the estimator can use.
Within this github issue and the ones referenced in it you can follow the long-standing discussion about making it possible to enable pipelines to transform the target, too. This is also summarized within this sklearn FAQ.
The specific reason for which you're getting TypeError: fit_transform() takes 2 positional arguments but 3 were given is the following (here seen from the perspective of a ColumnTransformer): when calling either .fit_transform() or .fit() on the ColumnTransformer istance, method ._fit_transform() is called in turn on X and y, and it triggers the call of ._fit_transform_one() and here the error arises. Indeed, it calls .fit_transform() on the transformer istance (your LabelEncoder); here the different method signature comes into play:
```
 with _print_elapsed_time(message_clsname, message):
     if hasattr(transformer, "fit_transform"):
         res = transformer.fit_transform(X, y, **fit_params)
     else:
         res = transformer.fit(X, y, **fit_params).transform(X)
```
Indeed, .fit_transform() is called on (self, X, y) ([...] 3 arguments were given) while expecting (self, y) only ([...] takes 2 positional arguments). Following the code within the Pipeline class, it can be seen that the same happens.
As already specified, an alternative to label-encoding applicable on feature variables (and therefore in pipelines and column transformers) is the OrdinalEncoder (from version 0.20). At this proposal, I would suggest the reading of Difference between OrdinalEncoder and LabelEncoder.

Solution 3:^[3]

You can use OrdinalEncoder for categorical variables.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Danylo Baibak
Solution 2
Solution 3	richardec

'Why should LabelEncoder from sklearn be used only for the target variable?

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]