'Is Stanza stanza library very slow
I have two sets of codes to count the number of sentences in one text file. The two options generate different results and Option 2(Stanza) is very slow. Is Option 2(Stanza) more accurate? How should I speedup Option 2(Stanza)? Thanks a lot!
Option 1 (Regular expression): The following codes takes 2 seconds and the output is 1444.
import requests
from bs4 import BeautifulSoup
import re
sentence_regex = re.compile(r"\b[A-Z](?:[^\.!?]|\.\d)*[\.!?]")
def identify_sentences(input_text:str):
"""Returns all sentences in the input text"""
sentences = re.findall(sentence_regex, input_text)
return sentences
r=requests.get("https://www.sec.gov/Archives/edgar/data/861439/0000912057-94-000263.txt", headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
text=soup.text
sentences=identify_sentences(text)
len(sentences)
Option 2(Stanza): The following codes takes 6 minutes and the output is 2481.
import requests
from bs4 import BeautifulSoup
import stanza
nlp=stanza.Pipeline(lang='en', processors='tokenize, pos, ner')
r=requests.get("https://www.sec.gov/Archives/edgar/data/861439/0000912057-94-000263.txt", headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
text=soup.text
doc=nlp(text)
sentences=doc.sentences
len(sentences)
Solution 1:[1]
Two answers:
- If all you're wanting to do is to split text into sentences, then your pipeline should be simply
nlp=stanza.Pipeline(lang='en', processors='tokenize')and that will be much faster than the pipeline you show that also runs a part-of-speech tagger and named entity recognizer. - But, yes, running Stanza is way slower than simply doing matching against a single regex. There should be many places where it works differently and better, because exclamation marks, question marks, and especially periods often occur in the middle of English sentences (e.g., here!). You'll have to decide for yourself whether the better accuracy is worth it to you.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Christopher Manning |
