'NLTK POS classifier using custom provided data

I was given code in the following format:

1   i PRP
2   'd  MD
3   like    VB
4   to  TO
5   go  VB
6   to  IN
7   a   DT
8   fancy   JJ
9   restaurant  NN
10  .   .

I am having trouble getting it into a format that the nltk.Unigram model will accept. Here is what I've tried so far.

from nltk import word_tokenize
import numpy as np


f = open('POS-training.txt', 'r')
content = f.read()
text = word_tokenize(content)

# Removing all digits from the text
cleaned_txt = [text for text in text if not text.isdigit()]
 
# Using an iterator and the zip() method to zip all the words with their respective tag 
it = iter(cleaned_txt)
tuple_txt = [*zip(it, it)]      

print(tuple_txt[:50])

Output:

[('i', 'PRP'), ("'d", 'MD'), ('like', 'VB'), ('to', 'TO'), ('go', 'VB'), ('to', 'IN'), ('a', 'DT'), ('fancy', 'JJ'), ('restaurant', 'NN'), ('.', '.'), ('i', 'PRP'), ("'d", 'MD'), ('like', 'VB'), ('french', 'JJ'), ('food', 'NN'), ('.', '.'), ('next', 'JJ'), ('thursday', 'NN'), ('.', '.'), ('next', 'JJ'), ('thursday', 'NN'), ('.', '.')]

It provides a list of tuples as expected but when trying the following:

uni = nltk.UnigramTagger(train=tuple_txt)

I get the following error:

ValueError: not enough values to unpack (expected 2, got 1)

I've looked through documentation and can't seem to find anything. Anyone know how?

python nltk

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'NLTK POS classifier using custom provided data

Sources

Related Questions