'How is the scikit-learn implementation of a decision tree different from the theoretical definition of the decision tree algorithm?

I'm trying to understand the inner workings of the sklearn decision tree classifier (using entropy as the split criterion). To start, I created a baseline decision tree using the theoretical definition of the model from 'Classification and Regression Trees' by Brieman et al. and adapting some code I found online. My model takes in binary variables for a binary classification problem (and uses entropy as the split criterion).

Comparing these models, I saw that sklearn outperformed my baseline on all datasets I used for testing (usually by several % points of accuracy, a small margin but still significant). I've went through the code for my baseline model with my PhD supervisor and we have found no errors. After searching through the sklearn source code on Github, I can't see any differences in the code that would account for this difference in performance.

Does the sklearn tree implement some extra steps that would explain this? Some automatic pruning step, data preprocessing or some alternative approach to tree building different to Brieman et al.?

Edit: Here is the code for the baseline model:

# Function to calculate entropy, k is occurrences of target, n is sample size
def H(k,n):
    # Sample proportion
    p = k/n
    res = -p*np.log(p) - (1-p)*np.log(1-p)
    return res

# Now begin the decision tree code
# Create a class for Node, each split/leaf in the tree is a node
class Node:
    # Initialise the characteristics of Node
    def __init__(self, predicted_class):
        self.predicted_class = predicted_class
        self.feature_index = 0
        self.threshold = 0
        self.left = None
        self.right = None

# Create a class for decision tree
class Our_DTClassifier:
    # Set max depth as inputted by user
    def __init__(self, max_depth=None):
        self.max_depth = max_depth
        
    def fit(self, X, y):
        # Sets the possible values for y ('classes'), in our case = {0, 1}
        self.n_classes_ = len(set(y))
        # Sets number of features in X
        self.n_features_ = X.shape[1]
        # Runs a function below to grow tree
        self.tree_ = self._grow_tree(X, y)
        
    # Runs a function below to predict
    def predict(self, X):
        return [self._predict(inputs) for inputs in X]

    # Finds best variable to split on based on the chosen entropy
    def _best_split(self, X, y):
        # n is how many people are at this node
        n = y.size
        # In case n is <=1, return None
        if n <= 1:
            return None, None
        # A count of how many have the target in this subset
        k = np.sum(y == 1)
        
        # Calculate the entropy of this node prior to any split
        # If none of the splits decrease entropy, we do not split
        init_ent = H(k,n)
        
        # Initialise best feature and best threshold variables
        best_idx, best_thr = None, None
        
        # Loop over each feature to calculate the entropy of the split
        for idx in range(self.n_features_):
            # A count of individuals in this subset who have the feature
            n_x = sum(X[:, idx] == 1)
            if n_x > 0 :
                if n_x != n:
                    # A count of individuals in this subset who have the feature AND the target
                    k_x = sum(np.logical_and(X[:, idx] == 1, y == 1)) 
                    # Calculate entropy for each new subset
                    left_ent = H(k_x, n_x)
                    right_ent = H(k-k_x,n-n_x)
                    
                    # Weighted sum to find total Entropy for this split
                    split_ent = (left_ent-right_ent)*((n_x)/(n)) + right_ent

                    # Save this split if it results in entropy lower than initial
                    if split_ent < init_ent:
                        init_ent = split_ent
                        best_idx = idx
                        # As we use bianry variables, 'threshold' is always 0.5 to split between 
                        # the possible outcomes {0,1}
                        best_thr = 0.5
        return best_idx, best_thr

    def _grow_tree(self, X, y, depth=0):
        # Make an array [count of people with target = 0, count of people with target = 1]
        num_samples_per_class = [np.sum(y == i) for i in range(self.n_classes_)]
        
        # Predicted class is whatever whatever target outcome (0 or 1) is most common in y
        predicted_class = np.argmax(num_samples_per_class)
        
        # Create an object of class Node with predicted_class=predicted_class
        node = Node(predicted_class=predicted_class)
        
        # Check if max depth has been reached
        if depth < self.max_depth:
            # Use best split function to find best feature to split on
            idx, thr = self._best_split(X, y)
            
            # Check if _best_split ran correctly
            if idx is not None:
                
                # Select values in X less than the threshold and store their indices
                indices_left = X[:, idx] < thr
                
                # Store the X and y values corresponding to the less than threshold values (left)
                X_left, y_left = X[indices_left], y[indices_left]
                
                # Store the rest as greater than threshold (right)
                X_right, y_right = X[~indices_left], y[~indices_left]
                
                # Store the feature being split on for this node
                node.feature_index = idx
                
                # Store the threshold for this node (will always be 0.5)
                node.threshold = thr
                
                # Recursive step, continue to grow the tree for each child node
                node.left = self._grow_tree(X_left, y_left, depth + 1)
                node.right = self._grow_tree(X_right, y_right, depth + 1)
        return node

    # Predict function
    def _predict(self, inputs):
        # select the tree structure
        node = self.tree_
        # Run the new data point through the tree until you reach a leaf
        while node.left:
            # Check if this person has the feature at this node split and send them left or right accordingly
            if inputs[node.feature_index] < node.threshold:
                node = node.left
            else:
                node = node.right
        # return the predicted class of the leaf node
        return node.predicted_class

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How is the scikit-learn implementation of a decision tree different from the theoretical definition of the decision tree algorithm?

Sources

Related Questions