'Good Turing Discounting language model : Replace test tokens not included in the vocabulary by <UNK>

In the below code I want to build a bigram language model with good turing discounting. The training files are the first 150 files of the WSJ treebank, while the test ones are the remaining 49.

Problem : After building the model and calling the test data, I must check the test tokens and replace those not included in the vocabulary by . However, I do not know how to access the learned model's vocabulary. Could you help with the code here ?

Any assistance is much appreciated. Thank you in advance.

from nltk.corpus import treebank
from nltk.util import pad_sequence
from nltk.util import bigrams, trigrams
from nltk.lm import Laplace
from nltk.probability import SimpleGoodTuringProbDist
from nltk import FreqDist
from nltk.lm.preprocessing import padded_everygram_pipeline 

# training data
train_treebank = []
for j in range(150):
    for i in treebank.sents(treebank.fileids()[j]):
        train_treebank.append(i)

# training bigrams
train_bigrams = []
for sent in train_treebank :
    train_bigrams.append(list(bigrams(pad_sequence(sent,
                      pad_left=True, left_pad_symbol="<START>",
                      pad_right=True, right_pad_symbol="<END>",
                      n=2))))
    
train_bigrams_onelist = [item for sublist in train_bigrams for item in sublist]

# learn good turing language model
freq_dist_bigrams = FreqDist(train_bigrams_onelist)
model = SimpleGoodTuringProbDist(freq_dist_bigrams)

# test data
test_treebank = []
for j in range(150, 199): # len(treebank.fileids()) = 199
    for i in treebank.sents(treebank.fileids()[j]):
        test_treebank.append(i)

# replace test tokens not included in vocabulary by <UNK>
# how to do it ?


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source