'Avoiding data leakage when using BaggingClassifier (Regressor) with feature scaling (StandardScaler)
I am running bagging with LogisticRegression. Since the latter uses regularization, features must be scaled. Since bagging takes a sample (with replacement) from the original training data set, scaling should take place after that. Scaling the original data set and then taking a sample amounts to data leakage. This is similar to how scaling is (mis)used with CV: it is wrong to scale the whole data set and then feed it to CV.
It appears that there are no built in tools to avoid data leakage with bagging (see the code below), but I may be wrong. Any help will be appreciated.
from sklearn.ensemble import BaggingClassifier
single_log_reg = LogisticRegression(solver="liblinear", random_state = np.random.RandomState(18))
bagged_logistic = BaggingClassifier(single_log_reg, n_estimators = 100, random_state = np.random.RandomState(42))
logit_bagged_pipeline = Pipeline(steps=[
('scaler', StandardScaler(with_mean = False)),
('bagged_logit', bagged_logistic)
])
logit_bagged_grid = {'bagged_logit__base_estimator__C': c_values,
'bagged_logit__max_features' : [100, 200, 400, 600, 800, 1000]}
logit_bagged_searcher = GridSearchCV(estimator = logit_bagged_pipeline, param_grid = logit_bagged_grid, cv = skf,
scoring = "roc_auc", n_jobs = 6, verbose = 4)
logit_bagged_searcher.fit(all_model_features, y_train)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
