'Catboost overfits training data but test performance increases

I'm training catboost on a dataset made of 41k observations and ~60 features. The dataset is a longitudinal series (9 years) that is spatially distributed. At the moment I'm just using random resampling of data, ignoring spatial and temporal dependencies. The model selection is performed using a 5 folds CV and some data are used as external test/held out set.

Best result I get with catboost is with following hps:

mtry=37, min_n = 458, tree_depth = 10, learn rate = 0.05

training AUC = .962

internal validation AUC = .867

external test AUC = .870

The difference between the training and test AUC is quite big and this suggests overfitting.

A second hp configuration, instead, reduces the difference between the training and test set but the test performance decreases as well.

mtry=19, min_n = 976, tree_depth = 8, learn rate = 0.0003

training AUC = .846

internal validation AUC = .841

external test AUC = .836

I'd be tempted to go with the first hps configuration since it gives me the best result on the test set. On the other hand the second result seems more robust to me, since training and test performance are quite similar. In addition the second result might be closer to the "true" performance I can get using spatial or temporal blocked resampling strategy.

Then my question is should I be concerned about differences between training and test set or as long as the test performance doesn't decrease (overfitting consequence) I shouldn't care about it and pick the first hps configuration?



Solution 1:[1]

Your intuition that "the second result might be closer to the 'true' result" is good. In a scenario where a model is overfitting, take even the performance on a validation and test set with a grain of salt. It could be that the pattern the model memorized for training still performs well on validation and test for now, but is a strong signal that the model is inflexible to variance, which in most cases is likely to occur with time.

Therefore, yes, you should be concerned about differences between training and test, and not simply select the model which has the best test performance. The difference in test performance between these two models is relatively small. Based on the little I know of what you have tried, I'd suggest iterating more to see if you can recapture a few points of accuracy while still eliminating the overfitting.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 K. Thorspear