'Passing `sample_weight` parameter to classifier in imblearn pipeline when using over/under sampling transformer

Context: I am using imblearn Pipeline as follows

        # Synthetic Minority Over-sampling Technique for Nominal and Continuous features
        features_cat_mask = np.in1d(self.X_features, self.X_features_cat)
        self.imbalance_transformer = SMOTENC(categorical_features=features_cat_mask)

        # Add binary column indicators for categorical features
        self.column_transformer = compose.make_column_transformer(
            (preprocessing.OneHotEncoder(handle_unknown='ignore',
                                         sparse=False), self.X_features_cat),
            remainder='passthrough')

        # Impute NaN values
        simple_imputer = SimpleImputer(strategy='median')

        model = RandomForestClassifier(n_jobs=-1,
                                       criterion='entropy',
                                       class_weight='balanced_subsample')

        self.clf = Pipeline(steps=[("imbalance_transformer", self.imbalance_transformer),
                       ("column_transformer", self.column_transformer),
                       ("simple_imputer", simple_imputer),
                       ("classifier", model)])

Previously before using imblearn SMOTENC I passed sample_weight using the following technique:

        self.clf.fit(self.X_train,
                     self.y_train,
                     classifier__sample_weight=self.sample_weight)

Where self.sample_weight was defined based on a column in the original dataframe that produces X_train and y_train (column = 'sample_weight').

However, since using imblearn, the number of rows output from imblearn is NOT equal to the number of rows in original datafram where sample_weight comes from. I get the following error: ValueError: sample_weight.shape == (1208,), expected (1830,)!

Question: What are some recommended techniques for passing sample_weight to the model when using an imblearn transformer (that changes the number of rows in the dataframe passed to the RF model).



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source