'spaCy: How to implement a special lookbehind in the word tokenizer?

I'm working on a text corpus in which many individual tokens contain punctuations like : - ) ( @. For example, TMI-Cu(OH). Therefore, I want to customize the tokenizer to avoid splitting on : - ) ( @, if they are tightly enclosed (no whitespace) by digits/letters.

From this post, I learned that I can modify the infix_finditer to achieve this. However, the solution still split on ), if a ) is not followed by a word/digit, as demonstrated in the example:

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = custom_tokenizer(nlp)

test_str0 = 'This is TMI-Cu(OH), and somethig else'
doc0 = nlp(test_str0)
[token.text for token in doc0]

The output is ['This', 'is', 'TMI-Cu(OH', ')', ',', 'and', 'somethig', 'else'], where the individual token TMI-Cu(OH) is split into two tokens ['TMI-Cu(OH', ')'].

Is it possible to implement a 'lookbehind' behavior in the tokenizer? Therefore, for a tuple ')' that is followed by a non-word/non-digit character, before splitting on it to generate a new token, look behind in the first place to see if there is a whitespace between the ')' and the paired '('. If there is no whitespace, then don't split.



Solution 1:[1]

You need to remove the ) from the suffixes:

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''(?:[^\w\s]|_)(?<![-:@()])''') # Matching all special chars with your exceptions
    suffixes = nlp.Defaults.suffixes
    suffixes.remove(r'\)')   # Removing the `\)` pattern from suffixes
    suffix_re = compile_suffix_regex(suffixes)
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = custom_tokenizer(nlp)

test_str0 = 'This is TMI-Cu(OH), and somethig else'
doc0 = nlp(test_str0)
print([token.text for token in doc0])

Output:

['This', 'is', 'TMI-Cu(OH)', ',', 'and', 'somethig', 'else']

Note the (?:[^\w\s]|_)(?<![-:@()]) regex I used for infix matching matches any special character other than whitespace and -, :, @, ( and ) chars.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Wiktor Stribiżew