'Good Turing Discounting language model : Replace test tokens not included in the vocabulary by <UNK>
In the below code I want to build a bigram language model with good turing discounting. The training files are the first 150 files of the WSJ treebank, while the test ones are the remaining 49.
Problem : After building the model and calling the test data, I must check the test tokens and replace those not included in the vocabulary by . However, I do not know how to access the learned model's vocabulary. Could you help with the code here ?
Any assistance is much appreciated. Thank you in advance.
from nltk.corpus import treebank
from nltk.util import pad_sequence
from nltk.util import bigrams, trigrams
from nltk.lm import Laplace
from nltk.probability import SimpleGoodTuringProbDist
from nltk import FreqDist
from nltk.lm.preprocessing import padded_everygram_pipeline
# training data
train_treebank = []
for j in range(150):
for i in treebank.sents(treebank.fileids()[j]):
train_treebank.append(i)
# training bigrams
train_bigrams = []
for sent in train_treebank :
train_bigrams.append(list(bigrams(pad_sequence(sent,
pad_left=True, left_pad_symbol="<START>",
pad_right=True, right_pad_symbol="<END>",
n=2))))
train_bigrams_onelist = [item for sublist in train_bigrams for item in sublist]
# learn good turing language model
freq_dist_bigrams = FreqDist(train_bigrams_onelist)
model = SimpleGoodTuringProbDist(freq_dist_bigrams)
# test data
test_treebank = []
for j in range(150, 199): # len(treebank.fileids()) = 199
for i in treebank.sents(treebank.fileids()[j]):
test_treebank.append(i)
# replace test tokens not included in vocabulary by <UNK>
# how to do it ?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
