'Leave out some words from lemmatisation in Python
While doing lemmatisation i've noticed it is not working properly on some words. That is why i am thinking how to leave them out from the lemmatisation and keep their original form.
I've created a "nolemmawords" list with such words and wonder how to do it.
def find_words(text, regex = words_regex):
tokens = regex.findall(text.lower())
return [w for w in tokens if w.isalpha() and len(w) > 2]
mystem = Mystem()
def lemmatize(words, lemmer = mystem, stopwords = stopwords_list):
lemmas = lemmer.lemmatize(' '.join(words))
return [w for w in lemmas if not w in stopwords
and w.isalpha()]
def preprocess(text):
return (lemmatize(find_words(text)))
I've tried this, but it completely eliminates such words, which is not my goal. Could you please help?
def find_words(text, regex = words_regex):
tokens = regex.findall(text.lower())
return [w for w in tokens if w.isalpha() and not w in nolemmawords and len(w) > 2]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
