'SpaCy lemmatizer does not work with upper case words
The SpaCy lemmatizer does not work with upper case words. I have a transformer that is a part of my sklearn pipeline:
class SpacyTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
self.nlp = spacy.load(
name='en_core_web_sm',
disable=["parser", "ner"]
)
def fit(self, X, y, **fit_params):
return self
def transform(self, X, y=None, **fit_params) -> list:
return [
' '.join(SpacyTransformer.process_document(doc))
for doc in self.nlp.pipe(X)
]
@staticmethod
def process_document(doc: Doc) -> Generator[str, None, None]:
# Tokenize document
for token in doc:
# Remove non-alphanumeric tokens
if not token.is_alpha:
continue
# Stopword removal
if token.is_stop:
continue
# Lemmatization
token = token.lemma_
print(token)
# Case folding
token = token.casefold()
# Yield token
yield token
And the outcome is very surprising.
spacy_transformer.fit_transform(["INDICATIONS: Indications indications, Ischemic cardiomyopathy RESULTS results patient TEETH The company Apple is looking at buying U.K. startups"], [1])
>> ['indications indication indication ischemic cardiomyopathy results result patient tooth company apple look buy startup']
I was thinking at first, why do I sometimes have still plural form of very basic words and then I discovered that these words were originally written in upper case. I was wondering if there is a way to automatically handle case folding (making them lower case) tokens in the nlp.pipe. I want to perform this operation just once during one loop on the corpus / doc as it is now.
The problem is, that if I make token lower case before case folding / lemmatizing, the Token object becomes a string and I cannot perform lemmatization. How can I resolve this issue?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
