'Why should LabelEncoder from sklearn be used only for the target variable?
I was trying to create a pipeline with a LabelEncoder to transform categorical values.
cat_variable = Pipeline(steps = [
('imputer',SimpleImputer(strategy = 'most_frequent')),
('lencoder',LabelEncoder())
])
num_variable = SimpleImputer(strategy = 'mean')
preprocess = ColumnTransformer (transformers = [
('categorical',cat_variable,cat_columns),
('numerical',num_variable,num_columns)
])
odel = RandomForestRegressor(n_estimators = 100, random_state = 0)
final_pipe = Pipeline(steps = [
('preprocessor',preprocess),
('model',model)
])
scores = -1 * cross_val_score(final_pipe,X_train,y,cv = 5,scoring = 'neg_mean_absolute_error')
But this is throwing a TypeError:
TypeError: fit_transform() takes 2 positional arguments but 3 were given
On further reference, I found out that transformers like LabelEncoders are not supposed to be used with features and should only be used on the prediction target.
class sklearn.preprocessing.LabelEncoder
Encode target labels with value between 0 and n_classes-1.
This transformer should be used to encode target values, i.e. y, and not the input X.
My question is, why can we not use LabelEncoder on feature variables and are there any other transformers that have a condition like this?
Solution 1:[1]
LabelEncoder can be used to normalize labels or to transform non-numerical labels. For the input categorical you should use OneHotEncoder.
The difference:
le = preprocessing.LabelEncoder()
le.fit_transform([1, 2, 2, 6])
array([0, 0, 1, 2])
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit_transform([[1], [2], [2], [6]]).toarray()
array([[1., 0., 0.],
[0., 1., 0.],
[0., 1., 0.],
[0., 0., 1.]])
Solution 2:[2]
LabelEncoder, by design, has to be used on the target variable and not on feature variables. This implies that the signature of methods .fit(), .transform() and .fit_transform() of the LabelEncoder class differs from the one of the transformers which are meant to be applied on features.
fit(y) vs fit(X[,y]) | transform(y) vs transform(X) | fit_transform(y) vs fit_transform(X[,y]) or similarly
fit(self, y) vs fit(self, X, y=None) | transform(self, y) vs transform(self, X) | fit_transform(self, y) vs fit_transform(self, X, y=None)
respectively for LabelEncoder-like transformers (i.e. transformers to be applied on target) and for transformers to be applied on features.
This same design also holds for LabelBinarizer and MultiLabelBinarizer. I would suggest the reading of the Transforming the prediction target (y) paragraph of the User Guide.
This said, here are a couple of considerations describing what happens when you try to use LabelEncoder in a Pipeline or in a ColumnTransformer:
Pipelines andColumnTransformers are about transforming and fitting data, not targets. They somehow "assume" the target is already in a state that the estimator can use.Within this github issue and the ones referenced in it you can follow the long-standing discussion about making it possible to enable pipelines to transform the target, too. This is also summarized within this sklearn FAQ.
The specific reason for which you're getting
TypeError: fit_transform() takes 2 positional arguments but 3 were givenis the following (here seen from the perspective of aColumnTransformer): when calling either.fit_transform()or.fit()on theColumnTransformeristance, method._fit_transform()is called in turn onXandy, and it triggers the call of._fit_transform_one()and here the error arises. Indeed, it calls.fit_transform()on thetransformeristance (yourLabelEncoder); here the different method signature comes into play:with _print_elapsed_time(message_clsname, message): if hasattr(transformer, "fit_transform"): res = transformer.fit_transform(X, y, **fit_params) else: res = transformer.fit(X, y, **fit_params).transform(X)Indeed,
.fit_transform()is called on(self, X, y)([...] 3 arguments were given) while expecting(self, y)only ([...] takes 2 positional arguments). Following the code within thePipelineclass, it can be seen that the same happens.As already specified, an alternative to label-encoding applicable on feature variables (and therefore in pipelines and column transformers) is the
OrdinalEncoder(from version 0.20). At this proposal, I would suggest the reading of Difference between OrdinalEncoder and LabelEncoder.
Solution 3:[3]
You can use OrdinalEncoder for categorical variables.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Danylo Baibak |
| Solution 2 | |
| Solution 3 | richardec |
