'Python - decision tree in lightgbm with odd values

I am trying to fit a single decision tree using the Python module lightgbm. However, I find the output a little strange. I have 15 explanatory variables and the numerical response variable has the following characteristic:

count    653.000000
mean      31.503813
std       11.838267
min       13.750000
25%       22.580000
50%       28.420000
75%       38.250000
max       76.750000
Name: X2, dtype: float64

I do the following to fit the tree: I first construct the Dataset object

df_train = lightgbm.Dataset(
    df, # The data 
    label = df[response], # The response series
    feature_name = features, # A list with names of all explanatory variables
    categorical_feature = categorical_vars # A list with names of the categorical ones
)

Next, I define the parameters and fit the model:

param = {
    # make it a single tree:
    'objective': 'regression',
    'bagging_freq':0,  # Disable bagging
    'feature_fraction':1, # don't randomly select features. consider all.
    'num_trees': 1,
    
    # tuning parameters
    'max_leaves': 20,
    'max_depth': -1,
    'min_data_in_leaf': 20
}

model = lightgbm.train(param, df_train)

From the model I extract the leaves of the tree as:

tree = model.trees_to_dataframe()[[
'right_child',
    'node_depth',
    'value',
    'count']]

leaves = tree[tree.right_child.isnull()]

print(leaves)

   right_child  node_depth      value  count
5         None           6  29.957982     20
6         None           6  30.138253     28
8         None           6  30.269373     34
9         None           6  30.404353     38
12        None           6  30.528705     33
13        None           6  30.651690     62
14        None           5  30.842856     59
17        None           5  31.080432     51
19        None           6  31.232860     21
20        None           6  31.358547     26
22        None           5  31.567571     43
23        None           5  31.795345     46
28        None           6  32.034321     27
29        None           6  32.247890     24
31        None           6  32.420886     22
32        None           6  32.594289     21
34        None           5  32.920932     20
35        None           5  33.210205     22
37        None           4  33.809376     36
38        None           4  34.887632     20

Now, if you look at the values, they range from (approximately) 30 to 35. This is far from capturing the distribution (shown above with min = 13.75 and max = 76.75) of the response variable.

Can anyone explain to me what is going on here?

Follow Up Based On Accepted Answer:

I tried to add 'learning_rate':1 and 'min_data_in_bin':1 to the parameter dict which resulted in the following tree:

   right_child  node_depth      value  count
5         None           6  16.045500     20
6         None           6  17.824074     27
8         None           6  19.157500     36
9         None           6  20.529730     37
12        None           6  21.805834     36
13        None           6  23.048387     62
14        None           5  24.975263     57
17        None           5  27.335385     52
19        None           6  29.006800     25
20        None           6  30.234286     21
22        None           5  32.221591     44
23        None           5  34.472272     44
28        None           6  36.808889     27
29        None           6  38.944583     24
31        None           6  40.674546     22
32        None           6  42.408572     21
34        None           5  45.675000     20
35        None           5  48.567728     22
37        None           4  54.559445     36
38        None           4  65.341999     20

This is much more desirable. This means, that we can now use lightgbm to mimic the behavior of a single decision tree with categorical features. As opposed to sklearn, lightgbm honors "true" categorical variables whereas in sklearn one needs to one-hot encode all categorical variables which can turn out really bad; see this kaggle post.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source