'Python| Unable to extract the list of names from the text

Executing the below to extract the list of names from the text1. The text1 variable is the merge of the pdf's. But executing the below code gives just one name out of complete input. Tried to change patterns but didn't work.

Code:

import spacy
from spacy.matcher import Matcher

# load pre-trained model
nlp = spacy.load('en_core_web_sm')

# initialize matcher with a vocab
matcher = Matcher(nlp.vocab)

def extract_name(resume_text):
    nlp_text = nlp(resume_text)
    #print(nlp_text)
    
    # First name and Last name are always Proper Nouns
    pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
    
    #matcher.add('NAME', None, [pattern])
    matcher.add('NAME', [pattern], on_match=None)
    
    matches = matcher(nlp_text)
    
    for match_id, start, end in matches:
        span = nlp_text[start:end]
        #print(span)
        return span.text

Execution: extract_name(text1) O/P: 'VIKRAM RATHOD'

Expected O/P: List of all names in the text1



Solution 1:[1]

For your questions :

Adding the matcher declaration :

self._nlp = spacy.load("en_core_web_lg") 
self._matcher = Matcher(self._nlp.vocab)  

As general best practice remove all punctuation:

  table = str.maketrans(string.punctuation,' '*32)   ##Remove punctuation
    sentence = sentence .translate(table).strip() 

To catch middle name add:

pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN',"OP": "*"},{'POS': 'PROPN'}]

Now loop over all the matches and add them to a dict

   New_list_of_matches={}
   for match_id, start, end in matches:
        string_id = ((self.NlpObj)._nlp.vocab).strings[match_id]  # Get string representation
        span=str((self.NlpObj)._doc[start:end]).split()           
        if string_id in New_list_of_matches:   
            if len(span)>New_list_of_matches[string_id]['lenofSpan']:
                New_list_of_matches[string_id]={'span':span,'lenofSpan':len(span)}
        else:
            New_list_of_matches[string_id]={'span':span,'lenofSpan':len(span)}

It is important to keep the length of the span that way you can differ between cases when you find names with 2 words with ones with 3 words(middle name)

Now :

for keys,items in  New_list_of_matches.items():
   if keys=='NAME':
          if len(items['span'])==2:
                 Name=items['span'][items['lenofSpan']-2]+' '+items['span'][items['lenofSpan']-1]
          elif len(items['span'])==3:
                Name=items['span'][items['lenofSpan']-3]+items['span'][items['lenofSpan']-2]+' '+items['span'][items['lenofSpan']-1]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1