'Scikit learn spam mail prediction code always predicts the same result
The code: spam mail prediction
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
raw_mail_data=pd.read_csv("mail_data.csv")
mail_data=raw_mail_data.where( (pd.notnull(raw_mail_data)),"" )
mail_data.loc[mail_data["Category"]=="spam","Category"]=0
mail_data.loc[mail_data["Category"]=="ham","Category"]=1
X=mail_data["Message"]
Y=mail_data["Category"]
X_train, X_test, Y_train, Y_test = train_test_split(X,
Y,
test_size=0.2,
random_state=42)
feature_extraction= TfidfVectorizer(min_df=1,stop_words="english",lowercase="True")
X_train_features=feature_extraction.fit_transform(X_train)
X_test_features=feature_extraction.transform(X_test)
Y_train=Y_train.astype('int')
Y_test=Y_test.astype('int')
model = LogisticRegression()
model.fit(X_train_features,Y_train)
prediction_on_training_data=model.predict(X_train_features)
accuracy_on_training_data=accuracy_score(Y_train,prediction_on_training_data)
print("Accuracy on training data:",accuracy_on_training_data)
prediction_on_test_data=model.predict(X_test_features)
accuracy_on_test_data=accuracy_score(Y_test,prediction_on_test_data)
print("Accuracy on test data:",accuracy_on_test_data)
inputs=input("please type a message.")
input_mail=[str(inputs)]
input_data_features=feature_extraction.transform(input_mail)
print("input_data_features:",input_data_features)
prediction=model.predict(input_data_features)
print("prediction:",prediction)
if prediction[0]==1:
print("Normal mail",prediction[0])
elif prediction[0]==0:
print("spam mail",prediction[0])
else:
print("unknown condition")
Even though I enter spam mail content as input (discount etc), I can't get a result of 0. (ie spam mail) The code cannot guess correctly. It always gives 1 result. What is the reason for this? The accuracy scores look normal (%96) for the train and test set. Does a mistake in the writing of the code cause the same result over and over? or should i try another algorithm like decision tree ?
Solution 1:[1]
Instead of using accuracy for model evaluation, you should use a measure that works well with class imbalance.
Have a look at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html
If you average the accuracy per class, then also classes with a small amount of samples will be optimized.
Using only accuracy the best thing your classifier can learn is to always say: No spam. (Because after all, most of mails are not spam.)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Nikolas Rieble |
