'Idenfying bigrams using Gensim Phraser that contain the word "not," for sentiment analysis

I am working on a sentiment analysis project where I am analyzing a corpus of documents, and I am specifically not removing the word "not" as a stopword, so that I can use it to determine if a text agrees or disagrees with something. For instance, there is a difference between "not effective" and "effective" when discussing the COVID vaccine.

However, my phraser is not identifying any bigrams with the word "not." I presume this is because that token exists in such large numbers (particularly because I expanded contractions, so "isn't" -> "is not"), that the scoring function simply scores all bigrams with "not" too low. This would be because the standard phrase scoring function is:

(where min_count is a hyper parameter)

So, since "not" exists many thousands of times in the database, worda_count will be very large, leading to a large denominator and dropping the score considerably.

Is there a way to get around this, so "not" bigrams are scored effectively?

I can think of a few options off the top of my head:

Write my own scoring function that effectively has two scoring formula: the standard scoring formula, and a different scoring formula if the first word is "not".
I could include "not" in a list of connector_words, but gensim.models.phrases.Phraser specifically indicates that these connector words cannot be at the beginning or end of a phrase.

Solution 1:^[1]

As you've discovered, the Phrases functionality in Gensim is pretty crude: it only combines words based on a meaning-oblivious statistical analysis. It's more likely to be helpful in promoting certain noun-phrases ('new_york') or idioms than generic syntactical reversals-of-meaning (as with an added 'not'). So whether you'll want to use it at all, I'm not sure.

You could try the most simpleminded thing possible: preprocess to always attach 'not' to the following word. Maybe it'll help!

You could also try some expensive grammar-aware preprocessing - the sort that labels words with parts-of-speech, & further identifies which other words/word-ranges a particular 'not' modifies. That might allow you to condiionally connect the 'not' to later words – maybe even non-contiguous words – & perhaps that will provide a lift to downstream sentiment-analysis.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	gojomo

'Idenfying bigrams using Gensim Phraser that contain the word "not," for sentiment analysis

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]