'Make Spacy tokenizer not split on /

How do I modify the English tokenizer to prevent splitting tokens on the '/' character?

For example, the following string should be one token:


import spacy

nlp = spacy.load('en_core_web_md')
doc = nlp("12/AB/568793")

for t in doc:
    print(f"[{t.pos_} {t.text}]")

# produces
#[NUM 12]
#[SYM /]
#[ADJ AB/568793]


Solution 1:[1]

The approach is a variation on removing a rule in the "Modifying existing rule sets" from Spacy documentation:


nlp = spacy.load('en_core_web_md')
infixes = nlp.Defaults.infixes
assert(len([x for x in infixes if '/' in x])==1)  # there seems to just be one rule that splits on /'s
# remove that rule; then modify the tokenizer
infixes = [x for x in infixes if '/' not in x]
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dave