'Higher Testing Accuracy and Lower Trainning Accuracy

I am rather new to the process of NLP, and I am running into a situation where my training accuracy is around 70% but my test accuracy is 80%. I have roughly 6000 entries from 2020 to be used as training data and 300 entires from first quarter of 2021 to be used as test data (due to unavailability of Q2,Q3,Q4 data). Each entire would have at least 2-3 paragraphs within them.

I have setup cross validation using RepeatedStratifiedKFold with 10 split and 3 repeat, and using grideserachCV with C=.1 and kernel = linear. Setup stop words (I did customized it somewhat such as include top 100 common names, month, as well as some of more common words that doesn't mean much in my setting), lowercased everything, and used Snowball stemmer. The resulting confusion matrix for the test set is as appeared

[[165  34]
[ 27  96]]

with F1 score of 81% However upon examing my trainning set it had means and std of 0.720 (+/-0.036)

I am trying to make out why there is a 9% difference between the trainning and test sets with test set getting a higher result as well as not sure what else I could do to further improve the accuracy.

My goal is to predict the unavailable data in Q2,Q3,Q4 and ultimately comparing those 3 when it is available

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Higher Testing Accuracy and Lower Trainning Accuracy

Sources

Related Questions