'separate texts into sentences NLTK vs spaCy

I want to separate texts into sentences.

looking in stack overflow I found:

WITH NLTK

from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weathe is great, and city is awesome. The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)

WITH SPACY

from spacy.lang.en import English # updated

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer')) # updated
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

The question is what in the background for spacy having to do it differently with a so called create_pipe. Sentences are important for training your own word embedings for NLP. There should be a reason why spaCy does not include directly out of the box a sentence tokenizer.

Thanks.

NOTE: Be aware that a simply .split(.) does not work, there are several decimal numbers in the text and other kind of tokens containing '.'

Solution 1:^[1]

The processing pipeline in spaCy has a modular setup, more information is given here: https://spacy.io/usage/processing-pipelines. You define which parts you want by defining the pipes. There are some use-cases where potentially you don't need sentences, like when you only want a bag-of-words representation. So I guess that may be why the sentencizer is not always included automatically - but it's there if you need it.

Note that English() is a pretty generic model - you can find some more useful pre-trained statistical models here: https://spacy.io/models/en

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'separate texts into sentences NLTK vs spaCy

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]