'Improving accuracy of NLP text classification model [closed]

I'm coding a spam email classification program for a project, using a CSV dataset of around 8500 emails and labels (0 for non-scam and 1 for scam). The average length of an email is around 100/200 words I'd say. I've imported the file using Pandas, cleaned the dataset (stopwords, punctuation) and used the SKLearn library to initialise three models (Naive Bayes, Logistic Regression and KNN):

def nb_model(dataset):
  print("\n\nNaive bayes")
  ds = dataset
  count_vec = CountVectorizer(analyzer=sanitize_text, ngram_range=(1,2))
  mail_tokens = count_vec.fit_transform(ds['mail']) #analyse and vectorize text within sample_ds['mail']
  print("tokens shape: ", mail_tokens.shape)
  
  from sklearn.model_selection import train_test_split
  x_train, x_test, y_train, y_test = train_test_split(mail_tokens, ds['phishing'], test_size=0.20)

  #result calculations
  from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
  pred = model.predict(x_train)
  print(classification_report(y_train, pred))
  print("\nTrain Acc: ", accuracy_score(y_train, pred))

  pred = model.predict(x_test)
  print(classification_report(y_test, pred))
  print("\nTest Acc: ", accuracy_score(y_test, pred))

  
  sample = "Your chance to receive a FREE Dyson Vacuum If you wish to unsubscribe from future mailings please click here or write to:6130 W Flamingo Rd. Las Vegas, NV 89103 unsubscribe here"
  sample = sanitize_text(sample)
  data = [sample]
  data_pd = pd.Series(data)
  var = count_vec.transform(data_pd).toarray()
  result = model.predict(var)
  print("sample result: ", result)
def log_model(dataset):
  print("\n\nLogistic regression")
  ds = dataset
  # from sklearn.preprocessing import MinMaxScaler
  # scaler = MinMaxScaler()
  # ds = scaler.fit_transform(ds)
  
  
  from sklearn.feature_extraction.text import TfidfVectorizer
  vec = TfidfVectorizer(analyzer=sanitize_text, ngram_range=(1,2))
  mail_tokens = vec.fit_transform(ds['mail'])
  

  x = mail_tokens# Features
  y = ds['phishing'] 

  from sklearn.model_selection import train_test_split
  x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)

  from sklearn.linear_model import LogisticRegression
  log = LogisticRegression()
  log = log.fit(x_train, y_train)

  from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
  pred = log.predict(x_train)
  print(classification_report(y_train, pred))
  print("\nTrain Acc: ", accuracy_score(y_train, pred))

  pred = log.predict(x_test)
  print(classification_report(y_test, pred))
  print("\nTest Acc: ", accuracy_score(y_test, pred))

  sample = "Your chance to receive a FREE Dyson Vacuum If you wish to unsubscribe from future mailings please click here or write to:6130 W Flamingo Rd. Las Vegas, NV 89103 unsubscribe here"
  sample = sanitize_text(sample)
  data = [sample]
  data_pd = pd.Series(data)
  var = vec.transform(data_pd).toarray()
  result = log.predict(var)
  print("sample result: ", result)
def knn_model(dataset):
  print("\n\nKNN")
  ds = dataset
  
  from sklearn.feature_extraction.text import TfidfVectorizer
  vec = TfidfVectorizer(analyzer=sanitize_text, ngram_range=(1, 2))
  mail_tokens = vec.fit_transform(ds['mail'])
  

  x = mail_tokens# Features
  y = ds['phishing'] 

  from sklearn.model_selection import train_test_split
  x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)

  from sklearn.neighbors import KNeighborsClassifier
  knn = KNeighborsClassifier()
  knn = knn.fit(x_train, y_train)

  from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
  pred = knn.predict(x_train)
  print(classification_report(y_train, pred))
  print("\nTrain Acc: ", accuracy_score(y_train, pred))

  pred = knn.predict(x_test)
  print(classification_report(y_test, pred))
  print("\nTest Acc: ", accuracy_score(y_test, pred))

  sample = "Your chance to receive a FREE Dyson Vacuum If you wish to unsubscribe from future mailings please click here or write to:6130 W Flamingo Rd. Las Vegas, NV 89103 unsubscribe here"
  sample = sanitize_text(sample)
  data = [sample]
  data_pd = pd.Series(data)
  var = vec.transform(data_pd).toarray()
  result = knn.predict(var)
  print("sample result: ", result)

The problem is I'm getting low accuracy across the three models, along with low recall/f1 scores but I'm not sure what they should be:

Naive bayes
tokens shape:  (8384, 8239)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3102
           1       1.00      1.00      1.00      3605

    accuracy                           1.00      6707
   macro avg       1.00      1.00      1.00      6707
weighted avg       1.00      1.00      1.00      6707


Train Acc:  1.0
              precision    recall  f1-score   support

           0       1.00      0.00      0.00       799
           1       0.52      1.00      0.69       878

    accuracy                           0.52      1677
   macro avg       0.76      0.50      0.35      1677
weighted avg       0.75      0.52      0.36      1677


Test Acc:  0.5241502683363148
sample result:  [1]


Logistic regression
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3139
           1       1.00      1.00      1.00      3568

    accuracy                           1.00      6707
   macro avg       1.00      1.00      1.00      6707
weighted avg       1.00      1.00      1.00      6707


Train Acc:  1.0
              precision    recall  f1-score   support

           0       1.00      0.00      0.00       762
           1       0.55      1.00      0.71       915

    accuracy                           0.55      1677
   macro avg       0.77      0.50      0.35      1677
weighted avg       0.75      0.55      0.39      1677


Test Acc:  0.5462134764460346
sample result:  [1]

KNN
              precision    recall  f1-score   support

           0       0.68      0.78      0.73      3150
           1       0.78      0.67      0.72      3557

    accuracy                           0.72      6707
   macro avg       0.73      0.72      0.72      6707
weighted avg       0.73      0.72      0.72      6707


Train Acc:  0.7213359176979275
              precision    recall  f1-score   support

           0       0.46      1.00      0.63       751
           1       1.00      0.04      0.07       926

    accuracy                           0.47      1677
   macro avg       0.73      0.52      0.35      1677
weighted avg       0.76      0.47      0.32      1677


Test Acc:  0.46869409660107336
sample result:  [0]

Could it be an issue with the initialisation and training of the models?



Solution 1:[1]

This question is not appropriate for Stack Overflow as it is about machine learning methodolgy, not coding, so it will likely be moved to CrossValidated. That said,

  1. You should fit your encoders after splitting and only on the train set.

  2. You are overfitting. This can be combated by reducing the complexity of your models by reducing dimensionality, and by adding regularization to your models. You can reduce the output dimensions of TfidfVectorizer by adjusting max_df and min_df, which will ignore tokens that appear in too many or too few documents respectively. sklearn adds l2 regularization to logistic regression by default, but elastic net may be more useful for tf-idf encoded data.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1