'Predicting score from text data and non-text data

This is for an assignment. I have ~100 publications on a dummy website. Each publication has been given an artificial score indicating how successful it is. I need to predict what affects that value.

So far, I've scraped the information that might affect the score and saved them into individual lists. The relevant part of my loop looks like:

for publicationurl:

Predictor 1 = 3.025

Predictor 2 = Journal A

Predictor 3 = 0

Response Variable = 42.5

Title = Sentence

Abstract = Paragraph

I can resolve most of that by putting predictors 1-3 and the response into a dataframe then doing regression. The bit that is tripping me up is the Title and Abstract text. I can strip them of punctuation and remove stopwords, but after that I'm not sure how to actually analyse them alongside the other predictors. I was looking into doing some kind of text-similarity comparison between high-high and high-low scoring pairs and basing whether the title and abstract affect the score based off of that, but am hoping there is a much neater method that allows me to actually put that text into a predictive model too.

I currently have 5 predictors besides the text and there are ~40,000 words in total across all titles and abstracts if any of that affects what kind of method works best. Ideally I'd like to end up being able to put everything into a single predictive model, but any method that can lead me to a working solution is good.



Solution 1:[1]

This would be an ideal situation for using Multinomial Naive Bayes. This is a relatively simple yet quite powerful method to classify the class of a text. If this is an introductory exercise I'm 99% sure that you're prof is expecting something with NB to solve the given problem.

I would recommend a library like sklearn which should make the task almost trivial. If you're interested in the intuition behind NB, this YouTube video should serve as a good introduction. Start off by going over some examples/blog posts, google should provide you with countless examples. Then modify the code to fit your use case. You could either group the articles in two classes e.g. score <=5 = bad, score > 5 = good. A next step would be to predict more than two classes like explained here.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 picklepick