'How to get each individual tree's prediction in xgboost?

Using xgboost.Booster.predict can only get the prediction result of all the tree or the predicted leaf of each tree. But how could I get the prediction value of each tree?



Solution 1:[1]

The xgboost.core.Booster has two methods that allows you to do so:

  1. First, xgboost.core.Booster.predict with the parameter pred_leaf set to True allows you to get the predicted leaf indices. Then, is just a matter of getting those indices scores.

  2. To get the leaf scores, we resort to the method xgboost.core.Booster.dump_model, which dumps the structure of the tree ensemble as a plain text or json. The dump contains the leaf scores.

Below I show an example.

First, train a xgboost model on the Iris Dataset.

import os
import json

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn import datasets

# Load data
iris = datasets.load_iris()
X, y = iris.data, iris.target
y = (y == 1).astype(int)

# Fit a model
n_estimators = 10
max_depth = 10
model = xgb.XGBClassifier(
    n_estimators=n_estimators,
    max_depth=max_depth,
    min_child_weight=1)
model.fit(X, y)
booster = model.get_booster()

Then get leaf indices predictions.

pred_leaf_index = booster.predict(
    xgb.DMatrix(X),
    pred_leaf=True
).reshape(X.shape[0], n_estimators)

To get the leaf scores we to dump the model as a json file. The resulting dump contains the tree structure.

# Dump the model and load the dump
model_json_path = '/tmp/model.json'
booster.dump_model(model_json_path, dump_format='json')
with open(model_json_path, 'r') as f:
    model_dict = json.loads(f.read())

Now, the following is perhaps the most complex part of this process. The following functions are aimed to get only the leaf scores by each three then for the entire ensamble:


def get_tree_leaf_scores(tree):
    """Retrieve a single tree leaf scores.

    Parameters
    ----------
    tree : dict
        A dictionary representing a single xgboost decision tree 
        (one item of the dump generated by `booster.dump_model`).

    Returns
    -------
    leafs : list
        Each item of the list is the left and right final leafs of
        the final branch of a tree.
    """

    if 'leaf' in tree:
        return tree
    else:
        branch_0 = get_tree_leaf_scores(tree['children'][0])
        branch_1 = get_tree_leaf_scores(tree['children'][1])

        if not isinstance(branch_0, list):
            branch_0 = [branch_0]
        if not isinstance(branch_1, list):
            branch_1 = [branch_1]

        return branch_0 + branch_1

def get_trees_leaf_as_dataframe(model_dict):
    """Retrieve the tree ensemble leaf scores.

    Parameters
    ----------
    model_dict : dict
        The dictionary from loading the dump resulting from:
        `xgboost.core.Booster.dump_model`

    Returns
    -------
    trees_leaf_df : pandas.DataFrame
        Tree/node ids with their leaf score.
    """
    # Get tree nodes
    trees_leaf_df = []
    for tree_idx, tree in enumerate(model_dict):
        tree_leafs = get_tree_leaf_scores(tree)
        tree_leafs = pd.DataFrame(tree_leafs)
        tree_leafs['treeid'] = tree_idx

        trees_leaf_df.append(tree_leafs)

    trees_leaf_df = pd.concat(
        trees_leaf_df
    ).sort_values(['treeid', 'nodeid'])

    trees_leaf_df['id'] = \
        trees_leaf_df.apply(
            lambda x: '%s-%s' % (int(x['treeid']), int(x['nodeid'])), axis=1)

    trees_leaf_df = trees_leaf_df[
        ['treeid', 'nodeid', 'id', 'leaf']
    ].set_index('id')

    return trees_leaf_df

Here is how you get the leaf scores as a DataFrame:

trees_leaf_df = get_trees_leaf_as_dataframe(model_dict)
trees_leaf_df.head()

Out[1]: 
   nodeid      leaf  treeid   id
0       1 -0.555556       0  0-1
4       4 -0.528000       0  0-4
3       6 -0.120000       0  0-6
1       7  0.150000       0  0-7
2       8  0.550000       0  0-8

At this point we are ready to get the model predicted leaf scores, with the help of the following function:


def get_pred_leaf_scores(pred_leaf_index, trees_leaf_df):
    """
    Return
    ------
        The predicted leaf scores.
    """
    tree_ids = range(0, n_estimators)
    pred_leaf_scores = []
    for single_instance_pred_leafs in pred_leaf_index:
        tree_node_id_predictions = [
            '%s-%s' % (treeid, nodeid)
            for treeid, nodeid in zip(tree_ids, single_instance_pred_leafs)]

        single_instnace_pred_leaf_scores = trees_leaf_df.loc[
            tree_node_id_predictions]['leaf'].values

        pred_leaf_scores.append(single_instnace_pred_leaf_scores)

    pred_leaf_scores = pd.DataFrame(pred_leaf_scores)

    return pred_leaf_scores
pred_leaf_scores = get_pred_leaf_scores(pred_leaf_index, trees_leaf_df)
pred_leaf_scores
Out[2]:
            0         1         2  ...         7         8         9
0   -0.555556 -0.434605 -0.373621  ... -0.248634 -0.231758 -0.215499
1   -0.555556 -0.434605 -0.373621  ... -0.248634 -0.231758 -0.215499
2   -0.555556 -0.434605 -0.373621  ... -0.248634 -0.231758 -0.215499
3   -0.555556 -0.434605 -0.373621  ... -0.248634 -0.231758 -0.215499
4   -0.555556 -0.434605 -0.373621  ... -0.248634 -0.231758 -0.215499
..        ...       ...       ...  ...       ...       ...       ...
145 -0.528000 -0.410725 -0.374272  ... -0.072375 -0.236201 -0.058543
146 -0.528000 -0.410725 -0.374272  ... -0.024406 -0.236201 -0.185685
147 -0.528000 -0.410725 -0.374272  ... -0.072375 -0.236201 -0.058543
148 -0.528000 -0.410725 -0.374272  ... -0.250879 -0.236201 -0.215589
149 -0.528000 -0.410725 -0.374272  ... -0.072375 -0.236201 -0.058543

[150 rows x 10 columns]    

If you want to make sure that the leaf scores yield the same probability predictions, do the following:

def from_leafs_scores_to_proba(pred_leaf_scores):
    """
    """

    # Get logistic function logit.
    logit = pred_leaf_scores.sum(axis=1)

    # Compute the logistic function
    pos_class_probability = 1 / (1 + np.exp(-logit))

    # Get negative and positive class probabilities.
    return pos_class_probability

y_scores_from_leafs = from_leafs_scores_to_proba(pred_leaf_scores)

y_scores_from_leafs.values[:10]
Out[9]: 
array([0.03715579, 0.03715579, 0.03715579, 0.03715579, 0.03715579,
       0.03715579, 0.03715579, 0.03715579, 0.03715579, 0.03715579])
y_scores = model.predict_proba(X)[:, 1]
y_scores[:10]
Out[10]: 
array([0.03715578, 0.03715578, 0.03715578, 0.03715578, 0.03715578,
       0.03715578, 0.03715578, 0.03715578, 0.03715578, 0.03715578],
      dtype=float32)

Solution 2:[2]

Much better solution is this.

In Python, you can dump the trees as a list of strings:

example:

m = xgb.XGBClassifier(max_depth=2, n_estimators=3).fit(X, y)
m.get_booster().get_dump()`

this is what you'll get:

booster[0]:
0:[sincelastrun<23.2917] yes=1,no=2,missing=2
    1:[sincelastrun<18.0417] yes=3,no=4,missing=4
        3:leaf=-0.0965415
        4:leaf=-0.0679503
    2:[sincelastrun<695.025] yes=5,no=6,missing=6
        5:leaf=-0.0992546
        6:leaf=-0.0984374
booster[1]:
0:[sincelastrun<23.2917] yes=1,no=2,missing=2
    1:[sincelastrun<16.8917] yes=3,no=4,missing=4
        3:leaf=-0.0928132
        4:leaf=-0.0676056
    2:[sincelastrun<695.025] yes=5,no=6,missing=6
        5:leaf=-0.0945284
        6:leaf=-0.0937463
booster[2]:
0:[sincelastrun<23.2917] yes=1,no=2,missing=2
    1:[sincelastrun<18.175] yes=3,no=4,missing=4
        3:leaf=-0.0878571
        4:leaf=-0.0610089
    2:[sincelastrun<695.025] yes=5,no=6,missing=6
        5:leaf=-0.0904395
        6:leaf=-0.0896808

Solution 3:[3]

As of recently, xgboost has introduced a slicing API, and Raul's answer, while valid, is overly complicated.

To get individual predictions all you need is to iterate through the booster object.

individual_preds = []
for tree_ in model.get_booster():
    individual_preds.append(
        tree_.predict(xgb.DMatrix(X))
    )

Note however, that those individual predictions are not individual contributions. E.g. summing them up will not get the final prediction. For that we need to transform them back into log-odds and then sum up:

from scipy.special import expit as sigmoid, logit as inverse_sigmoid
individual_preds = np.vstack(individual_preds)
indivudual_logits = inverse_sigmoid(individual_preds)
final_logits = indivudual_logits.sum(axis=0)
final_preds = sigmoid(final_logits)

Fully reproducible example, replicating Raul's efforts

import numpy as np
import xgboost as xgb
from sklearn import datasets
from scipy.special import expit as sigmoid, logit as inverse_sigmoid

# Load data
iris = datasets.load_iris()
X, y = iris.data, (iris.target == 1).astype(int)

# Fit a model
model = xgb.XGBClassifier(
    n_estimators=10,
    max_depth=10,
    use_label_encoder=False,
    objective='binary:logistic'
)
model.fit(X, y)
booster_ = model.get_booster()

# Extract indivudual predictions
individual_preds = []
for tree_ in booster_:
    individual_preds.append(
        tree_.predict(xgb.DMatrix(X))
    )
individual_preds = np.vstack(individual_preds)

# Aggregated individual predictions to final predictions
indivudual_logits = inverse_sigmoid(individual_preds)
final_logits = indivudual_logits.sum(axis=0)
final_preds = sigmoid(final_logits)

# Verify correctness
xgb_preds = booster_.predict(xgb.DMatrix(X))
np.testing.assert_almost_equal(final_preds, xgb_preds)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Vojtech Stas
Solution 3 Ufos