'Should min_child_samples accept a subsample ratio float?
Let's say I have a training dataset of 1000 samples in my dataframe, but the number of samples will increase over time, and I will want to automatically and regularly retrain my model to pick up new information.
If I set min_child_samples to 100, then that will be the minimum number of samples needed to form a leaf. But as my dataset grows, less than 10% of samples will be needed to meet this minimum, some permutations of the "wrong type of data" might meet the criteria, and my model may suffer as a result.
What I might actually want is to be able to specify that 10% of samples are needed to form a leaf, but min_child_samples does not accept a 0.1 float value (as, say, colsample_bytree does, albeit for a different thing).
I can work around this by setting min_child_samples = int(len(X_train)/10) in my model definition, but this feels like something that either I shouldn't have to do, or shouldn't be doing for some other reason that I'm not aware of.
So, am I misunderstanding the concept of this parameter, or should it accept a float value <= 1.0 as well as a definitive int value?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
