'Applying k-folds (stratified 10-fold) to my text classification model

I need help in the following, I have a data frame with Columns: Class (0,1) and text. After cleansing (lemmatizing, removing stopwords, etc), I split the data like the following:

#splitting datset

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset['text_lemm_nostop'], dataset['class'],test_size=0.3, random_state=50)

Then I used n-gram:

#using n-gram

from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(min_df=5,ngram_range=(2,2), max_features=5000).fit(X_train) 
print('No. of features:')
len(vect.get_feature_names_out()) # how many features

Then I did the vectorizing:

X_train_vectorized=vect.transform(X_train)
X_test_vectorized=vect.transform(X_test)

Then I started applying ML algorithms like (logistic regression, Naive Bayes, RF,..etc) and I will share only the logistic regression

#Logistic regression

from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(X_train_vectorized, y_train)

from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn import metrics 
predictions=model.predict(vect.transform(X_test))
print("AUC Score is: ", roc_auc_score(y_test, predictions),'\n')
print("Accuracy:",metrics.accuracy_score(y_test, predictions),'\n')
print('Classification Report:\n',classification_report(y_test,predictions))

My Questions:

1. Is what I am doing is fine in case I am going with normal splitting (30% test)? (I feel having issues with n-gram code!)

2. If I want to engage the K-fold cross-validation (ie. Stratified 10-fold), how could I do that in my code?

Appreciate any help and support!!



Solution 1:[1]

A few comments:

  • Generally I don't see any obvious mistake, seems fine to me.
  • Do you really mean to use only bigrams (n=2)? This implies that you don't use unigrams (single words), so you might miss some information this way. It's more common to use ngram_range=(1,2) for instance. It's probably worth trying unigrams only, at least to compare the performance.
  • To perform 10-fold cross-validation, I'd suggest you use KFold because it's the simplest method. In this case it's almost the same code, except that you insert it inside a loop which iterates the 10 folds. Every time you fit the model on a specific training set (90%) and evaluate it on the remaining 10% test set. Both training and test set are obtained by selecting the corresponding indexes indexes returned by KFold.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Erwan