'Stanza (Standford NLP) does not work when parallel processing rows in data frame

I have a dataframe with 800,000 rows and for each row, I want to find the person mentioned in each comment (row.comment). I want to use Stanza because it has higher accuracy and I implemented parallelization with df.iterrows() in order to increase the execution speed. When I try to implement Stanza to find the name of the person without multiprocessing it works, and when I try to do the same thing, but using SpaCy it also works, which means that the problem is related to this package.

import stanza
nlp = stanza.Pipeline(lang='en', processors='tokenize, ner') # initialize English neural pipeline
def stanza_function(arg):
    try:
        idx,row = arg
        comment = preprocess_comment(str(row['comment'])) # Retrieve body of the comment
        person_name = ''
        doc = nlp(str(comment))
        persons_mentioned = [word.text for word in doc.ents if word.type == 'PERSON']
        if (len(persons_mentioned) == 1):
            person_name = persons_mentioned[0]
    except:
        print("Error")
        
    return person_name

def spacy_function(arg):
    idx,row = arg
    comment = preprocess_comment(str(row['comment'])) # Retrieve body of the comment
    person_name = ''
    comment_NER = NER(str(comment)) # Implement NER
    persons_mentioned = [word.text for word in comment_NER.ents if word.label_ == 'PERSON']
    print(persons_mentioned)
    if (len(persons_mentioned) == 1):
        person_name = persons_mentioned[0]
    return person_name

pool = mp.Pool(processes=mp.cpu_count())
persons = pool.map(stanza_function, [(idx,row) for idx,row in df.iterrows()])
df['person_name'] = persons

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Stanza (Standford NLP) does not work when parallel processing rows in data frame

Sources

Related Questions