'Decision tree score always return 1

I got 100% score on my test set when trained using decision tree and I think it's a bit strange, given that i set max_depth = 2. And I don't understand that I did wrong. I split my set on train and test sets, but classifier still returns 1. Here is my code.

This is my dataset songs.csv. A little about this dataset: size: 400 elements, the distribution of classes is almost uniform, so I don’t understand why the decision tree gives such an ideal score even max_depth = 2.

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

df = pd.read_csv('D:Projects/datasets/songs.csv')

X = df.drop(['lyrics', 'song', 'artist'], axis=1)
y = df.artist

le = LabelEncoder()
le.fit(X.genre.unique())
X.genre = le.transform(X.genre)
le.fit(y.unique())
y = pd.Series(le.transform(y))

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.35)
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
accuracy_score(y_test, predictions)


Solution 1:[1]

Maybe the problem comes from the dataset. I tried on RandomForestClassifierand got 1.0 accuracy. Similarly, used validation and test data together. But the result was not changed. This is the tree graph.

>>> text_representation = tree.export_text(clf)
>>> print(text_representation)

|--- feature_1 <= 0.50
|   |--- class: 2
|--- feature_1 >  0.50
|   |--- feature_1 <= 1.50
|   |   |--- class: 1
|   |--- feature_1 >  1.50
|   |   |--- class: 0

Here is my full code:

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import ensemble
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

df = pd.read_csv('songs.csv')

X = df.drop(['lyrics', 'song', 'artist'], axis=1)
y = df.artist

le = LabelEncoder()
le.fit(X.genre.unique())
X.genre = le.transform(X.genre)
le.fit(y.unique())
y = pd.Series(le.transform(y))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2
    
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2)
clf.fit(X_train, y_train)
predictions = clf.predict(X_val)
print(accuracy_score(y_val, predictions))

predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))

text_representation = tree.export_text(clf)
print(text_representation)

clf = ensemble.RandomForestClassifier(criterion='entropy')
clf.fit(X_train, y_train)
predictions = clf.predict(X_val)
print(accuracy_score(y_val, predictions))

predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))

Solution 2:[2]

I trained a decision-tree model using the code and dataset you have provided, it seems to work as expected. It's not unusual to get perfect classification score for small datasets like this one, and it seems the task is simple enough to be solved perfectly by a tree of height 2. There seems to be nothing wrong with the code.

You can visualise the resulting tree using tree.plot_tree:

enter image description here

As a sanity check, I also observed that the model accuracy is less than 70% when you limit the height of the tree to only 1. So I think the code you have provided is fine.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 uozcan12
Solution 2 Aravind G.