'How to deal with overfitting of xgboost classifier?

I use xgboost to do a multi-class classification of spectrogram images(data link: automotive target classification). The class number is 5, training data includes 20000 samples(each class 5000 samples), test data includes 5000 samples(each class 1000 samples), the original image size is 144*400. This is my code snippet:

train_data, train_label, test_data, test_label = load_data(data_dir, resampleX=4, resampleY=5)
scaler = StandardScaler()
train_data = scaler.fit_transform(train_data)
test_data = scaler.transform(test_data)
cv_params = {'n_estimators': [100,200,300,400,500], 'learning_rate': [0.01, 0.1]}
other_params = {'learning_rate': 0.1,  'n_estimators': 100, 
                'max_depth': 5, 'min_child_weight': 1, 'seed': 27, 'nthread': 6,
                'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 
                'reg_alpha': 0, 'reg_lambda': 1,
                'objective': 'multi:softmax', 'num_class': 5}
model = XGBClassifier(**other_params)
classifier = GridSearchCV(estimator=model, param_grid=cv_params, cv=3, verbose=1, n_jobs=6)
classifier.fit(train_data, train_label)
print("The best parameters are %s with a score of %0.2f" % (classifier.best_params_, classifier.best_score_))

During hyperparameter tunning, according to https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/, I tuned n_estimators at first with GridSearchCV(n_estimators=[100,200,300,400,500]) using training data, then test with test data. Then I tried GridSearchCV with both 'n_estimators' and 'learning_rate' also.

The best hyperparameter is n_estimators=500+ 'learning_rate=0.1' with best_score_=0.83, when I use this best estimator to classify, the training data I get 100% correct result, but the test data only gets precison of [0.864 0.777 0.895 0.856 0.882] and recall of [0.941 0.919 0.764 0.874 0.753]. I guess with n_estimators=500 is overfitting, but I don't know how to choose this n_estimator and learning_rate at this step.

For reducing dimensionality, I tried PCA but more than n_components>3500 is needed to achieve 95% variance, so I use downsampling instead as shown in code.

Sorry for the incomplete info, hope this time is clear. Many thanks!



Solution 1:[1]

Why not try Optuna for XGBoost hyperparameter tuning, with pruning and with early_stopping_rounds parameter of XGBoost ?

Here is a notebook of mine as a guide only. XGBoost version must be 1.6 though, as early_stopping_rounds is run differently (fit() method) in XGBoost versions below 1.6.

https://www.kaggle.com/code/josephramon/sba-optuna-xgboost

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 J R