'How to expand the set of abbreviations for NLTK's sentence tokenizer?

How to avoid NLTK's sentence tokenizer splitting on abbreviations consisting of two words?

The solution provided in this answer works well on abbreviations consisting of only one word, but failed on abbreviations like et al. For example,

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters

extra_abbreviation = ['et al','e.coli']
punkt_param = PunktParameters()
punkt_param.abbrev_types = set(extra_abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)

test = 'Elsa et al. reported a new method, e. g., a new method. This is E. Coli., and something else.'
tokenizer.tokenize(test)

returns four sentences

['Elsa et al.',
 'reported a new method, e. g., a new method.',
 'This is E.',
 'Coli., and something else.']

python nltk

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to expand the set of abbreviations for NLTK's sentence tokenizer?

Sources

Related Questions