'How to interpret Python NLTK bigram likelihood ratios?
I'm trying to figure out how to properly interpret nltk's "likelihood ratio" given the below code (taken from this question).
import nltk.collocations
import nltk.corpus
import collections
bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(nltk.corpus.brown.words())
scored = finder.score_ngrams(bgm.likelihood_ratio)
# Group bigrams by first word in bigram.
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
prefix_keys[key[0]].append((key[1], scores))
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
prefix_keys['baseball']
With the following output:
[('game', 32.11075451975229),
('cap', 27.81891372457088),
('park', 23.509042621473505),
('games', 23.10503351305401),
("player's", 16.22787286342467),
('rightfully', 16.22787286342467),
[...]
Looking at the docs, it looks like the likelihood ratio printed next to each bigram is from
"Scores ngrams using likelihood ratios as in Manning and Schutze 5.3.4."
Referring to this article, which states on pg. 22:
One advantage of likelihood ratios is that they have a clear intuitive interpretation. For example, the bigram powerful computers is e^(.5*82.96) = 1.3*10^18 times more likely under the hypothesis that computers is more likely to follow powerful than its base rate of occurrence would suggest. This number is easier to interpret than the scores of the t test or the 2 test which we have to look up in a table.
What I'm confused about is what would be the "base rate of occurence" in the event that I'm using the nltk code noted above with my own data. Would it be safe to say, for example, that "game" is 32 times more likely to appear next to "baseball" in the current dataset than in the average use of the standard English language? Or is it that "game" is more likely to appear next to "baseball" than other words appearing next to "baseball" within the same set of data?
Any help/guidance towards a clearer interpretation or example is much appreciated!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
