'NLTK POS classifier using custom provided data
I was given code in the following format:
1 i PRP
2 'd MD
3 like VB
4 to TO
5 go VB
6 to IN
7 a DT
8 fancy JJ
9 restaurant NN
10 . .
I am having trouble getting it into a format that the nltk.Unigram model will accept. Here is what I've tried so far.
from nltk import word_tokenize
import numpy as np
f = open('POS-training.txt', 'r')
content = f.read()
text = word_tokenize(content)
# Removing all digits from the text
cleaned_txt = [text for text in text if not text.isdigit()]
# Using an iterator and the zip() method to zip all the words with their respective tag
it = iter(cleaned_txt)
tuple_txt = [*zip(it, it)]
print(tuple_txt[:50])
Output:
[('i', 'PRP'), ("'d", 'MD'), ('like', 'VB'), ('to', 'TO'), ('go', 'VB'), ('to', 'IN'), ('a', 'DT'), ('fancy', 'JJ'), ('restaurant', 'NN'), ('.', '.'), ('i', 'PRP'), ("'d", 'MD'), ('like', 'VB'), ('french', 'JJ'), ('food', 'NN'), ('.', '.'), ('next', 'JJ'), ('thursday', 'NN'), ('.', '.'), ('next', 'JJ'), ('thursday', 'NN'), ('.', '.')]
It provides a list of tuples as expected but when trying the following:
uni = nltk.UnigramTagger(train=tuple_txt)
I get the following error:
ValueError: not enough values to unpack (expected 2, got 1)
I've looked through documentation and can't seem to find anything. Anyone know how?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
