'SpaCy lemmatizer does not work with upper case words

The SpaCy lemmatizer does not work with upper case words. I have a transformer that is a part of my sklearn pipeline:

class SpacyTransformer(BaseEstimator, TransformerMixin):
  def __init__(self):
    self.nlp = spacy.load(
      name='en_core_web_sm', 
      disable=["parser", "ner"]
    )
  
  def fit(self, X, y, **fit_params):
    return self
  
  def transform(self, X, y=None, **fit_params) -> list:
    return [
      ' '.join(SpacyTransformer.process_document(doc)) 
      for doc in self.nlp.pipe(X)
    ]


  @staticmethod
  def process_document(doc: Doc) -> Generator[str, None, None]:
    # Tokenize document
    for token in doc:
      # Remove non-alphanumeric tokens
      if not token.is_alpha:
        continue
        
      # Stopword removal
      if token.is_stop:
        continue
        
      # Lemmatization
      token = token.lemma_
      print(token)
      
      # Case folding
      token = token.casefold()

      # Yield token
      yield token

And the outcome is very surprising.

spacy_transformer.fit_transform(["INDICATIONS: Indications indications, Ischemic cardiomyopathy RESULTS results patient TEETH The company Apple is looking at buying U.K. startups"], [1])
>> ['indications indication indication ischemic cardiomyopathy results result patient tooth company apple look buy startup']

I was thinking at first, why do I sometimes have still plural form of very basic words and then I discovered that these words were originally written in upper case. I was wondering if there is a way to automatically handle case folding (making them lower case) tokens in the nlp.pipe. I want to perform this operation just once during one loop on the corpus / doc as it is now.

The problem is, that if I make token lower case before case folding / lemmatizing, the Token object becomes a string and I cannot perform lemmatization. How can I resolve this issue?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source