'Why the decision tree algorithm in python change every run?
I am following a course on udemy about data science with python. The course is focused on the output of the algorithm and less on the algorithm by itself. In particular I am performing a decision tree. Every doing I run the algorithm on python, also with the same samples, the algorithm gives me a slightly different decision tree. I have asked to the tutors and they told me "The decision trees does not guarantee the same results each run because of its nature." Someone can explain me why more in detail or maybe give me an advice for a good book about it?
I did the decision tree of my data importing:
import numpy as np
import pandas as pd
from sklearn import tree
and doing this command:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,y)
where X are my feature data and y is my target data
Thank you
Solution 1:[1]
The DecisionTreeClassifier() function is apparently documented here:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
So this function has many arguments. But in Python, function arguments may have default values. Here, all arguments have default values, so you can even call the function with an empty argument list, like this:
clf = tree.DecisionTreeClassifier()
The parameter of interest, random_state is documented like this:
random_state: int, RandomState instance or None, default=None
So your call is equivalent to, among many other things:
clf = tree.DecisionTreeClassifier(random_state=None)
The None value tells the library that you don't want to bother with providing a seed (that is, an initial state) to the underlying pseudo-random number generator. Hence, the library has to come up with some seed.
Typically, it will take the current time value, with microsecond precision if possible, and apply some hash function. So at every call you will get a different initial state, and so a different sequence of pseudo-random numbers. Hence, a different tree.
You might want to try forcing the seed. For example:
clf = tree.DecisionTreeClassifier(random_state=42)
and see if your problem persists.
Now, regarding why does the decision tree require pseudo-random numbers, this is discussed for example here:
According to scikit-learn’s “best” and “random” implementation [4], both the “best” splitter and the “random” splitter uses Fisher-Yates-based algorithm to compute a permutation of the features array.
The Fisher-Yates algorithm is the most common way to compute a random permutation. Also, if stopped before completion, it can be used to extract a random subset of the data sample, for example if you need a random 10% of the sample to be excluded from the data fitting and set aside for a later cross-validation step.
Side note: in some circumstances, non-reproducibility can become a pain point, for example if you want to study the influence of an external parameter, say some global Y values bias. In that case, you don't want uncontrolled changes in the random numbers to blur the effects of your parameter changes. Hence the need for the API to provide some way to control the seed value.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
