'separate texts into sentences NLTK vs spaCy
I want to separate texts into sentences.
looking in stack overflow I found:
WITH NLTK
from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weathe is great, and city is awesome. The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)
WITH SPACY
from spacy.lang.en import English # updated
raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer')) # updated
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
The question is what in the background for spacy having to do it differently with a so called create_pipe. Sentences are important for training your own word embedings for NLP. There should be a reason why spaCy does not include directly out of the box a sentence tokenizer.
Thanks.
NOTE: Be aware that a simply .split(.) does not work, there are several decimal numbers in the text and other kind of tokens containing '.'
Solution 1:[1]
The processing pipeline in spaCy has a modular setup, more information is given here: https://spacy.io/usage/processing-pipelines. You define which parts you want by defining the pipes. There are some use-cases where potentially you don't need sentences, like when you only want a bag-of-words representation. So I guess that may be why the sentencizer is not always included automatically - but it's there if you need it.
Note that English() is a pretty generic model - you can find some more useful pre-trained statistical models here: https://spacy.io/models/en
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
