'Make Spacy tokenizer not split on /
How do I modify the English tokenizer to prevent splitting tokens on the '/' character?
For example, the following string should be one token:
import spacy
nlp = spacy.load('en_core_web_md')
doc = nlp("12/AB/568793")
for t in doc:
print(f"[{t.pos_} {t.text}]")
# produces
#[NUM 12]
#[SYM /]
#[ADJ AB/568793]
Solution 1:[1]
The approach is a variation on removing a rule in the "Modifying existing rule sets" from Spacy documentation:
nlp = spacy.load('en_core_web_md')
infixes = nlp.Defaults.infixes
assert(len([x for x in infixes if '/' in x])==1) # there seems to just be one rule that splits on /'s
# remove that rule; then modify the tokenizer
infixes = [x for x in infixes if '/' not in x]
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Dave |
