'Sklearn pass fit() parameters to xgboost in pipeline

Similar to How to pass a parameter to only one part of a pipeline object in scikit learn? I want to pass parameters to only one part of a pipeline. Usually, it should work fine like:

estimator = XGBClassifier()
pipeline = Pipeline([
        ('clf', estimator)
    ])

and executed like

pipeline.fit(X_train, y_train, clf__early_stopping_rounds=20)

but it fails with:

    /usr/local/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
        114         """
        115         Xt, yt, fit_params = self._pre_transform(X, y, **fit_params)
    --> 116         self.steps[-1][-1].fit(Xt, yt, **fit_params)
        117         return self
        118 

    /usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/sklearn.py in fit(self, X, y, sample_weight, eval_set, eval_metric, early_stopping_rounds, verbose)
        443                               early_stopping_rounds=early_stopping_rounds,
        444                               evals_result=evals_result, obj=obj, feval=feval,
    --> 445                               verbose_eval=verbose)
        446 
        447         self.objective = xgb_options["objective"]

    /usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/training.py in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, learning_rates, xgb_model, callbacks)
        201                            evals=evals,
        202                            obj=obj, feval=feval,
    --> 203                            xgb_model=xgb_model, callbacks=callbacks)
        204 
        205 

    /usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/training.py in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks)
         97                                end_iteration=num_boost_round,
         98                                rank=rank,
    ---> 99                                evaluation_result_list=evaluation_result_list))
        100         except EarlyStopException:
        101             break

    /usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/callback.py in callback(env)
        196     def callback(env):
        197         """internal function"""
    --> 198         score = env.evaluation_result_list[-1][1]
        199         if len(state) == 0:
        200             init(env)

    IndexError: list index out of range

Whereas a

estimator.fit(X_train, y_train, early_stopping_rounds=20)

works just fine.



Solution 1:[1]

For the early stopping rounds, you must always specify the validation set given by the argument eval_set. Here is how the error in your code can be fixed.

pipeline.fit(X_train, y_train, clf__early_stopping_rounds=20, clf__eval_set=[(test_X, test_y)])

Solution 2:[2]

This is the solution: https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/13755/xgboost-early-stopping-and-other-issues both early_stooping_rounds and the watchlist / eval_set need to be passed. Unfortunately, this does not work for me, as the variables on the watchlist would require a preprocessing step which is only applied in the pipeline / I would need to apply this step manually.

Solution 3:[3]

I recently used the following steps to use the eval metric and eval_set parameters for Xgboost.

1. create the pipeline with the pre-processing/feature transformation steps:

This was made from a pipeline defined earlier which includes the xgboost model as the last step.

pipeline_temp = pipeline.Pipeline(pipeline.cost_pipe.steps[:-1])  

2. Fit this Pipeline

X_trans = pipeline_temp.fit_transform(X_train[FEATURES],y_train)

3. Create your eval_set by applying the transformations to the test set

eval_set = [(X_trans, y_train), (pipeline_temp.transform(X_test), y_test)]

4. Add your xgboost step back into the Pipeline

 pipeline_temp.steps.append(pipeline.cost_pipe.steps[-1])

5. Fit the new pipeline by passing the Parameters

pipeline_temp.fit(X_train[FEATURES], y_train,
             xgboost_model__eval_metric = ERROR_METRIC,
             xgboost_model__eval_set = eval_set)

6. Persist the Pipeline if you wish to.

joblib.dump(pipeline_temp, save_path)

Solution 4:[4]

Here's a solution that works in a Pipeline with GridSearchCV:

Over-ride the XGBRegressor or XGBClssifier.fit() Function

  • This step uses train_test_split() to select the specified number of validation records from X for the eval_set and then passes the remaining records along to fit().
  • A new parameter eval_test_size is added to .fit() to control the number of validation records. (see train_test_split test_size documenation)
  • **kwargs passes along any other parameters added by the user for the XGBRegressor.fit() function.
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import train_test_split

class XGBRegressor_ES(XGBRegressor):
    
    def fit(self, X, y, *, eval_test_size=None, **kwargs):
        
        if eval_test_size is not None:
        
            params = super(XGBRegressor, self).get_xgb_params()
            
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=eval_test_size, random_state=params['random_state'])
            
            eval_set = [(X_test, y_test)]
            
            # Could add (X_train, y_train) to eval_set 
            # to get .eval_results() for both train and test
            #eval_set = [(X_train, y_train),(X_test, y_test)] 
            
            kwargs['eval_set'] = eval_set
            
        return super(XGBRegressor_ES, self).fit(X_train, y_train, **kwargs) 

Example Usage

Below is a multistep pipeline that includes multiple transformations to X. The pipeline's fit() function passes the new evaluation parameter to the XGBRegressor_ES class above as xgbr__eval_test_size=200. In this example:

  • X_train contains text documents passed to the pipeline.
  • XGBRegressor_ES.fit() uses train_test_split() to select 200 records from X_train for the validation set and early stopping. (This could also be a percentage such as xgbr__eval_test_size=0.2)
  • The remaining records in X_train are passed along to XGBRegressor.fit() for the actual fit().
  • Early stopping may now occur after 75 rounds of unchanged boosting for each cv fold in a gridsearch.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_regression
   
xgbr_pipe = Pipeline(steps=[('tfidf', TfidfVectorizer()),
                     ('vt',VarianceThreshold()),
                     ('scaler', StandardScaler()),
                     ('Sp', SelectPercentile()),
                     ('xgbr',XGBRegressor_ES(n_estimators=2000,
                                             objective='reg:squarederror',
                                             eval_metric='mae',
                                             learning_rate=0.0001,
                                             random_state=7))    ])

X_train = train_idxs['f_text'].values
y_train = train_idxs['Pct_Change_20'].values

Example Fitting the Pipeline:

%time xgbr_pipe.fit(X_train, y_train, 
                    xgbr__eval_test_size=200,
                    xgbr__eval_metric='mae', 
                    xgbr__early_stopping_rounds=75)

Example Fitting GridSearchCV:

learning_rate = [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3]
param_grid = dict(xgbr__learning_rate=learning_rate)

grid_search = GridSearchCV(xgbr_pipe, param_grid, scoring="neg_mean_absolute_error", n_jobs=-1, cv=10)
grid_result = grid_search.fit(X_train, y_train, 
                    xgbr__eval_test_size=200,
                    xgbr__eval_metric='mae', 
                    xgbr__early_stopping_rounds=75)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Aashita Kesarwani
Solution 2 Georg Heiler
Solution 3 gdv820
Solution 4 Jake Drew