'Idenfying bigrams using Gensim Phraser that contain the word "not," for sentiment analysis
I am working on a sentiment analysis project where I am analyzing a corpus of documents, and I am specifically not removing the word "not" as a stopword, so that I can use it to determine if a text agrees or disagrees with something. For instance, there is a difference between "not effective" and "effective" when discussing the COVID vaccine.
However, my phraser is not identifying any bigrams with the word "not." I presume this is because that token exists in such large numbers (particularly because I expanded contractions, so "isn't" -> "is not"), that the scoring function simply scores all bigrams with "not" too low. This would be because the standard phrase scoring function is:
(where min_count is a hyper parameter)
So, since "not" exists many thousands of times in the database, worda_count will be very large, leading to a large denominator and dropping the score considerably.
Is there a way to get around this, so "not" bigrams are scored effectively?
I can think of a few options off the top of my head:
Write my own scoring function that effectively has two scoring formula: the standard scoring formula, and a different scoring formula if the first word is "not".
I could include "not" in a list of
connector_words, butgensim.models.phrases.Phraserspecifically indicates that these connector words cannot be at the beginning or end of a phrase.
Solution 1:[1]
As you've discovered, the Phrases functionality in Gensim is pretty crude: it only combines words based on a meaning-oblivious statistical analysis. It's more likely to be helpful in promoting certain noun-phrases ('new_york') or idioms than generic syntactical reversals-of-meaning (as with an added 'not'). So whether you'll want to use it at all, I'm not sure.
You could try the most simpleminded thing possible: preprocess to always attach 'not' to the following word. Maybe it'll help!
You could also try some expensive grammar-aware preprocessing - the sort that labels words with parts-of-speech, & further identifies which other words/word-ranges a particular 'not' modifies. That might allow you to condiionally connect the 'not' to later words – maybe even non-contiguous words – & perhaps that will provide a lift to downstream sentiment-analysis.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | gojomo |

