'Improving accuracy of NLP text classification model [closed]
I'm coding a spam email classification program for a project, using a CSV dataset of around 8500 emails and labels (0 for non-scam and 1 for scam). The average length of an email is around 100/200 words I'd say. I've imported the file using Pandas, cleaned the dataset (stopwords, punctuation) and used the SKLearn library to initialise three models (Naive Bayes, Logistic Regression and KNN):
def nb_model(dataset):
print("\n\nNaive bayes")
ds = dataset
count_vec = CountVectorizer(analyzer=sanitize_text, ngram_range=(1,2))
mail_tokens = count_vec.fit_transform(ds['mail']) #analyse and vectorize text within sample_ds['mail']
print("tokens shape: ", mail_tokens.shape)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(mail_tokens, ds['phishing'], test_size=0.20)
#result calculations
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = model.predict(x_train)
print(classification_report(y_train, pred))
print("\nTrain Acc: ", accuracy_score(y_train, pred))
pred = model.predict(x_test)
print(classification_report(y_test, pred))
print("\nTest Acc: ", accuracy_score(y_test, pred))
sample = "Your chance to receive a FREE Dyson Vacuum If you wish to unsubscribe from future mailings please click here or write to:6130 W Flamingo Rd. Las Vegas, NV 89103 unsubscribe here"
sample = sanitize_text(sample)
data = [sample]
data_pd = pd.Series(data)
var = count_vec.transform(data_pd).toarray()
result = model.predict(var)
print("sample result: ", result)
def log_model(dataset):
print("\n\nLogistic regression")
ds = dataset
# from sklearn.preprocessing import MinMaxScaler
# scaler = MinMaxScaler()
# ds = scaler.fit_transform(ds)
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(analyzer=sanitize_text, ngram_range=(1,2))
mail_tokens = vec.fit_transform(ds['mail'])
x = mail_tokens# Features
y = ds['phishing']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)
from sklearn.linear_model import LogisticRegression
log = LogisticRegression()
log = log.fit(x_train, y_train)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = log.predict(x_train)
print(classification_report(y_train, pred))
print("\nTrain Acc: ", accuracy_score(y_train, pred))
pred = log.predict(x_test)
print(classification_report(y_test, pred))
print("\nTest Acc: ", accuracy_score(y_test, pred))
sample = "Your chance to receive a FREE Dyson Vacuum If you wish to unsubscribe from future mailings please click here or write to:6130 W Flamingo Rd. Las Vegas, NV 89103 unsubscribe here"
sample = sanitize_text(sample)
data = [sample]
data_pd = pd.Series(data)
var = vec.transform(data_pd).toarray()
result = log.predict(var)
print("sample result: ", result)
def knn_model(dataset):
print("\n\nKNN")
ds = dataset
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(analyzer=sanitize_text, ngram_range=(1, 2))
mail_tokens = vec.fit_transform(ds['mail'])
x = mail_tokens# Features
y = ds['phishing']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn = knn.fit(x_train, y_train)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = knn.predict(x_train)
print(classification_report(y_train, pred))
print("\nTrain Acc: ", accuracy_score(y_train, pred))
pred = knn.predict(x_test)
print(classification_report(y_test, pred))
print("\nTest Acc: ", accuracy_score(y_test, pred))
sample = "Your chance to receive a FREE Dyson Vacuum If you wish to unsubscribe from future mailings please click here or write to:6130 W Flamingo Rd. Las Vegas, NV 89103 unsubscribe here"
sample = sanitize_text(sample)
data = [sample]
data_pd = pd.Series(data)
var = vec.transform(data_pd).toarray()
result = knn.predict(var)
print("sample result: ", result)
The problem is I'm getting low accuracy across the three models, along with low recall/f1 scores but I'm not sure what they should be:
Naive bayes
tokens shape: (8384, 8239)
precision recall f1-score support
0 1.00 1.00 1.00 3102
1 1.00 1.00 1.00 3605
accuracy 1.00 6707
macro avg 1.00 1.00 1.00 6707
weighted avg 1.00 1.00 1.00 6707
Train Acc: 1.0
precision recall f1-score support
0 1.00 0.00 0.00 799
1 0.52 1.00 0.69 878
accuracy 0.52 1677
macro avg 0.76 0.50 0.35 1677
weighted avg 0.75 0.52 0.36 1677
Test Acc: 0.5241502683363148
sample result: [1]
Logistic regression
precision recall f1-score support
0 1.00 1.00 1.00 3139
1 1.00 1.00 1.00 3568
accuracy 1.00 6707
macro avg 1.00 1.00 1.00 6707
weighted avg 1.00 1.00 1.00 6707
Train Acc: 1.0
precision recall f1-score support
0 1.00 0.00 0.00 762
1 0.55 1.00 0.71 915
accuracy 0.55 1677
macro avg 0.77 0.50 0.35 1677
weighted avg 0.75 0.55 0.39 1677
Test Acc: 0.5462134764460346
sample result: [1]
KNN
precision recall f1-score support
0 0.68 0.78 0.73 3150
1 0.78 0.67 0.72 3557
accuracy 0.72 6707
macro avg 0.73 0.72 0.72 6707
weighted avg 0.73 0.72 0.72 6707
Train Acc: 0.7213359176979275
precision recall f1-score support
0 0.46 1.00 0.63 751
1 1.00 0.04 0.07 926
accuracy 0.47 1677
macro avg 0.73 0.52 0.35 1677
weighted avg 0.76 0.47 0.32 1677
Test Acc: 0.46869409660107336
sample result: [0]
Could it be an issue with the initialisation and training of the models?
Solution 1:[1]
This question is not appropriate for Stack Overflow as it is about machine learning methodolgy, not coding, so it will likely be moved to CrossValidated. That said,
You should fit your encoders after splitting and only on the train set.
You are overfitting. This can be combated by reducing the complexity of your models by reducing dimensionality, and by adding regularization to your models. You can reduce the output dimensions of
TfidfVectorizerby adjustingmax_dfandmin_df, which will ignore tokens that appear in too many or too few documents respectively.sklearnadds l2 regularization to logistic regression by default, but elastic net may be more useful for tf-idf encoded data.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
