'Passing `sample_weight` parameter to classifier in imblearn pipeline when using over/under sampling transformer
Context: I am using imblearn Pipeline as follows
# Synthetic Minority Over-sampling Technique for Nominal and Continuous features
features_cat_mask = np.in1d(self.X_features, self.X_features_cat)
self.imbalance_transformer = SMOTENC(categorical_features=features_cat_mask)
# Add binary column indicators for categorical features
self.column_transformer = compose.make_column_transformer(
(preprocessing.OneHotEncoder(handle_unknown='ignore',
sparse=False), self.X_features_cat),
remainder='passthrough')
# Impute NaN values
simple_imputer = SimpleImputer(strategy='median')
model = RandomForestClassifier(n_jobs=-1,
criterion='entropy',
class_weight='balanced_subsample')
self.clf = Pipeline(steps=[("imbalance_transformer", self.imbalance_transformer),
("column_transformer", self.column_transformer),
("simple_imputer", simple_imputer),
("classifier", model)])
Previously before using imblearn SMOTENC I passed sample_weight using the following technique:
self.clf.fit(self.X_train,
self.y_train,
classifier__sample_weight=self.sample_weight)
Where self.sample_weight was defined based on a column in the original dataframe that produces X_train and y_train (column = 'sample_weight').
However, since using imblearn, the number of rows output from imblearn is NOT equal to the number of rows in original datafram where sample_weight comes from. I get the following error: ValueError: sample_weight.shape == (1208,), expected (1830,)!
Question: What are some recommended techniques for passing sample_weight to the model when using an imblearn transformer (that changes the number of rows in the dataframe passed to the RF model).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
