'Python| Unable to extract the list of names from the text
Executing the below to extract the list of names from the text1. The text1 variable is the merge of the pdf's. But executing the below code gives just one name out of complete input. Tried to change patterns but didn't work.
Code:
import spacy
from spacy.matcher import Matcher
# load pre-trained model
nlp = spacy.load('en_core_web_sm')
# initialize matcher with a vocab
matcher = Matcher(nlp.vocab)
def extract_name(resume_text):
nlp_text = nlp(resume_text)
#print(nlp_text)
# First name and Last name are always Proper Nouns
pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
#matcher.add('NAME', None, [pattern])
matcher.add('NAME', [pattern], on_match=None)
matches = matcher(nlp_text)
for match_id, start, end in matches:
span = nlp_text[start:end]
#print(span)
return span.text
Execution: extract_name(text1) O/P: 'VIKRAM RATHOD'
Expected O/P: List of all names in the text1
Solution 1:[1]
For your questions :
Adding the matcher declaration :
self._nlp = spacy.load("en_core_web_lg")
self._matcher = Matcher(self._nlp.vocab)
As general best practice remove all punctuation:
table = str.maketrans(string.punctuation,' '*32) ##Remove punctuation
sentence = sentence .translate(table).strip()
To catch middle name add:
pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN',"OP": "*"},{'POS': 'PROPN'}]
Now loop over all the matches and add them to a dict
New_list_of_matches={}
for match_id, start, end in matches:
string_id = ((self.NlpObj)._nlp.vocab).strings[match_id] # Get string representation
span=str((self.NlpObj)._doc[start:end]).split()
if string_id in New_list_of_matches:
if len(span)>New_list_of_matches[string_id]['lenofSpan']:
New_list_of_matches[string_id]={'span':span,'lenofSpan':len(span)}
else:
New_list_of_matches[string_id]={'span':span,'lenofSpan':len(span)}
It is important to keep the length of the span that way you can differ between cases when you find names with 2 words with ones with 3 words(middle name)
Now :
for keys,items in New_list_of_matches.items():
if keys=='NAME':
if len(items['span'])==2:
Name=items['span'][items['lenofSpan']-2]+' '+items['span'][items['lenofSpan']-1]
elif len(items['span'])==3:
Name=items['span'][items['lenofSpan']-3]+items['span'][items['lenofSpan']-2]+' '+items['span'][items['lenofSpan']-1]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
