'subsample, colsample_bytree, colsample_bylevel in XGBClassifier() Python 3.x

I've spent a good deal of time trying to find out what these "subsample", "colsample_by_tree", and "colsample_bylevel" actually did in XGBClassifier() but I can't exactly find out what they do. Can someone please explain briefly what it is they do?

Thanks!



Solution 1:[1]

The idea of "subsample", "colsample_by_tree", and "colsample_bylevel" comes from Random Forests. In it, you build an ensemble of many trees and then group them together when making a prediction.

The "random" part happens through random sampling of the training samples for each tree (bootstrapping), and building each tree (actually each tree's node) only considering a random subset of the attributes.

In other words, for each tree in a random forest you:

  1. Select a random sample from the dataset to train this tree;
  2. For each node of this tree, use a random subset of the features. This avoids overfitting and decorrelates the trees.

Similarly to random forests, XGB is an ensemble of weak models that when put together give robust and accurate results. The weak models can be decision trees, which can be randomized in the same way as random forests. In this case:

  • "subsample" is the fraction of the training samples (randomly selected) that will be used to train each tree.
  • "colsample_by_tree" is the fraction of features (randomly selected) that will be used to train each tree.
  • "colsample_bylevel" is the fraction of features (randomly selected) that will be used in each node to train each tree.

Solution 2:[2]

hope someone will find my answer helpful:

  1. colsample_bytree - random subsample of columns when new tree is created
  2. colsample_bylevel - random subsample of columns when every new new level is reached. I.e. you have tree with 3 levels, on 1st level A & B are chosen, on the second B & C etc. Note: this sampling is based on the 1st one (colsample_bytree).
  3. colsample_bynode - random subsample of columns based on every split (left or right swerve). So, every level may have 2 different subsamples if this level has left and right split. Note: this sampling is based on ?2 as ?2 is based on ?1.

Docs: https://xgboost.readthedocs.io/en/latest/parameter.html

CMND+F the following and you'll find the part: colsample_bytree, colsample_bylevel, colsample_bynode [default=1]

Solution 3:[3]

This article has a sweet and short visual explanation:

https://medium.com/analytics-vidhya/xgboost-colsample-by-hyperparameters-explained-6c0bac1bdc1

but basically you specify a percentage of the column features to use at each tree, level, or node. If you set it to 0.5, you will use half off your columns. They build on top of each other, as trees have different levels and end in a node, so:

colsample_by* parameters work cumulatively. For instance, the combination {'colsample_bytree':0.5, 'colsample_bylevel':0.5, 'colsample_bynode':0.5} with 64 features will leave 8 features to choose from at each split.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Álvaro Salgado
Solution 2 SleeplessChallenger
Solution 3 PeJota