'AttributeError: 'spacy.tokenizer.Tokenizer' object has no attribute 'tokens_from_list'

Sorry,I don't know howt fix this error.

'spacy.tokenizer.Tokenizer' object has no attribute 'tokens_from_list'

Error of code is under.

import spacy
import re

regexp = re.compile('(?u)\\b\\w\\w+\\b')
en_nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
old_tokenizer = en_nlp.tokenizer

en_nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(
    regexp.findall(string))

def custom_tokenizer(document):
    doc_spacy = en_nlp(document)
    return [token.lemma_ for token in doc_spacy]

lemma_vect = CountVectorizer(tokenizer=custom_tokenizer, min_df=5)
X_train_lemma = lemma_vect.fit_transform(text_train)
print("X_train_lemma.shape: {}".format(X_train_lemma.shape))

vect = CountVectorizer(min_df=5).fit(text_train)
X_train = vect.transform(text_train)
print("X_train.shape: {}".format(X_train.shape))

Please help me, a lot of time wasted to solve this error



Solution 1:[1]

Am I seeing it correctly that you are using SpaCy to tokenize while also overwriting its tokenizer with a custom tokenizer? And then you throw away everything except the tokenization?

If that is the case then you shouldn't be using SpaCy but instead just split by the pattern yourself. I would take a look at the pattern though and see if this is really what you want.

import re
pattern = re.compile('(?u)\\b\\w\\w+\\b')

# print the substrings that match
print(pattern.findall("Sorry, I don't know how to fix this error."))
> ['Sorry', 'don', 'know', 'how', 'to', 'fix', 'this', 'error']

# print the substrings between matches
print(pattern.split("Sorry, I don't know how to fix this error."))
> ['', ', I ', "'t ", ' ', ' ', ' ', ' ', ' ', '.']

Generally if all you need is tokenization I would not recommend using SpaCy as it is quite slow and does a lot more than just the tokenization.

One alternative to SpaCy is NLTK.

import nltk
sentence = "Sorry, I don't know how to fix this error."
tokens = nltk.word_tokenize(sentence)
print(tokens)
> ['Sorry', ',', 'I', 'do', "n't", 'know', 'how', 'to', 'fix', 'this', 'error', '.']

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1