'Python - decision tree in lightgbm with odd values
I am trying to fit a single decision tree using the Python module lightgbm. However, I find the output a little strange. I have 15 explanatory variables and the numerical response variable has the following characteristic:
count 653.000000
mean 31.503813
std 11.838267
min 13.750000
25% 22.580000
50% 28.420000
75% 38.250000
max 76.750000
Name: X2, dtype: float64
I do the following to fit the tree: I first construct the Dataset object
df_train = lightgbm.Dataset(
df, # The data
label = df[response], # The response series
feature_name = features, # A list with names of all explanatory variables
categorical_feature = categorical_vars # A list with names of the categorical ones
)
Next, I define the parameters and fit the model:
param = {
# make it a single tree:
'objective': 'regression',
'bagging_freq':0, # Disable bagging
'feature_fraction':1, # don't randomly select features. consider all.
'num_trees': 1,
# tuning parameters
'max_leaves': 20,
'max_depth': -1,
'min_data_in_leaf': 20
}
model = lightgbm.train(param, df_train)
From the model I extract the leaves of the tree as:
tree = model.trees_to_dataframe()[[
'right_child',
'node_depth',
'value',
'count']]
leaves = tree[tree.right_child.isnull()]
print(leaves)
right_child node_depth value count
5 None 6 29.957982 20
6 None 6 30.138253 28
8 None 6 30.269373 34
9 None 6 30.404353 38
12 None 6 30.528705 33
13 None 6 30.651690 62
14 None 5 30.842856 59
17 None 5 31.080432 51
19 None 6 31.232860 21
20 None 6 31.358547 26
22 None 5 31.567571 43
23 None 5 31.795345 46
28 None 6 32.034321 27
29 None 6 32.247890 24
31 None 6 32.420886 22
32 None 6 32.594289 21
34 None 5 32.920932 20
35 None 5 33.210205 22
37 None 4 33.809376 36
38 None 4 34.887632 20
Now, if you look at the values, they range from (approximately) 30 to 35. This is far from capturing the distribution (shown above with min = 13.75 and max = 76.75) of the response variable.
Can anyone explain to me what is going on here?
Follow Up Based On Accepted Answer:
I tried to add 'learning_rate':1 and 'min_data_in_bin':1 to the parameter dict which resulted in the following tree:
right_child node_depth value count
5 None 6 16.045500 20
6 None 6 17.824074 27
8 None 6 19.157500 36
9 None 6 20.529730 37
12 None 6 21.805834 36
13 None 6 23.048387 62
14 None 5 24.975263 57
17 None 5 27.335385 52
19 None 6 29.006800 25
20 None 6 30.234286 21
22 None 5 32.221591 44
23 None 5 34.472272 44
28 None 6 36.808889 27
29 None 6 38.944583 24
31 None 6 40.674546 22
32 None 6 42.408572 21
34 None 5 45.675000 20
35 None 5 48.567728 22
37 None 4 54.559445 36
38 None 4 65.341999 20
This is much more desirable. This means, that we can now use lightgbm to mimic the behavior of a single decision tree with categorical features. As opposed to sklearn, lightgbm honors "true" categorical variables whereas in sklearn one needs to one-hot encode all categorical variables which can turn out really bad; see this kaggle post.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
