'Scikit learn spam mail prediction code always predicts the same result

The code: spam mail prediction


import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score


raw_mail_data=pd.read_csv("mail_data.csv")

mail_data=raw_mail_data.where( (pd.notnull(raw_mail_data)),"" )



mail_data.loc[mail_data["Category"]=="spam","Category"]=0

mail_data.loc[mail_data["Category"]=="ham","Category"]=1


X=mail_data["Message"]        
Y=mail_data["Category"]       

X_train, X_test, Y_train, Y_test = train_test_split(X, 
                                                    Y, 
                                                    test_size=0.2,  
                                                    random_state=42)


feature_extraction= TfidfVectorizer(min_df=1,stop_words="english",lowercase="True")




X_train_features=feature_extraction.fit_transform(X_train) 
X_test_features=feature_extraction.transform(X_test) 




Y_train=Y_train.astype('int')
Y_test=Y_test.astype('int')


model = LogisticRegression()
model.fit(X_train_features,Y_train)



prediction_on_training_data=model.predict(X_train_features)
accuracy_on_training_data=accuracy_score(Y_train,prediction_on_training_data)
print("Accuracy on training data:",accuracy_on_training_data)



prediction_on_test_data=model.predict(X_test_features)
accuracy_on_test_data=accuracy_score(Y_test,prediction_on_test_data)
print("Accuracy on test data:",accuracy_on_test_data)



inputs=input("please type a message.")

input_mail=[str(inputs)]

input_data_features=feature_extraction.transform(input_mail)

print("input_data_features:",input_data_features)

prediction=model.predict(input_data_features)
print("prediction:",prediction)

if prediction[0]==1:
   print("Normal mail",prediction[0])
elif  prediction[0]==0:
   print("spam mail",prediction[0])
else:
   print("unknown condition")

Even though I enter spam mail content as input (discount etc), I can't get a result of 0. (ie spam mail) The code cannot guess correctly. It always gives 1 result. What is the reason for this? The accuracy scores look normal (%96) for the train and test set. Does a mistake in the writing of the code cause the same result over and over? or should i try another algorithm like decision tree ?



Solution 1:[1]

Instead of using accuracy for model evaluation, you should use a measure that works well with class imbalance.

Have a look at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html

If you average the accuracy per class, then also classes with a small amount of samples will be optimized.

Using only accuracy the best thing your classifier can learn is to always say: No spam. (Because after all, most of mails are not spam.)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nikolas Rieble