'Rule-based matching don't return the same value when i use simple regex in python

I execute a regex with python and I get the result I want

import re
s = """
LOONNEY    BVBA     0431.568 836
Cock   number     1542 222. 325
"""
expression = r"(\d+)(\.|\s)(\d+)?(\.|\s|\.\s)(\d+)"

this return :

0431.568 836
1542 222. 325

Then I transport the regex to do a spacy via rule-based and nothing happened

nlp = spacy.blank("en")
patterns = [
                {
                    "label": "VAT_NUMBER",
                    "pattern": [{"TEXT": {"REGEX": r"(\d+)(\.|\s)(\d+)?(\.|\s|\.\s)(\d+)"}}]                              
                }
            ]
#add patterns to ruler
ruler = EntityRuler(nlp)
nlp.add_pipe(ruler)
ruler.add_patterns(patterns)
#create the doc
doc = nlp(text)
#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

return nothing;

what's wrong? May be the or | or space \s?



Solution 1:[1]

You cannot use a regex that is supposed to match multiple tokens within a single token REGEX operator pattern. You can always check what tokens you get with a code like print( [t.text for t in doc] ). You will quickly find out that 0431.568 831 is split into ['0431.568', '831'], 0431 568.833 is split into ['0431', '568.833'], etc.

Since you keep the actual requirements to yourself, here are is a sample solution for the strings you provided and a bit more cases that I tested here.

import spacy
from spacy.matcher import Matcher
from spacy.lang.en import English

nlp = English()

pattern = [
    [{"TEXT": {"REGEX": "^\d+\.\d+\.\d+$"}}], # Line 2: If three numbers are dot separated
    [{"TEXT": {"REGEX": "^\d+\.\d+\.?$"}}, {"TEXT": ".", "OP": "?"}, {'IS_DIGIT': True}], # Line 1: If two first numbers are dot separated + a number
    [{'IS_DIGIT': True}, {"TEXT": {"REGEX": "^\.?\d+\.\d+$"}}], # Line 3: If a number + two last numbers are dot separated
    [{'IS_DIGIT': True}, {'IS_DIGIT': True}, {'TEXT': '.', 'OP': '?'}, {'IS_DIGIT': True}], # Line 4: all spaces
    
]
matcher = Matcher(nlp.vocab)
matcher.add("codez", pattern)

text = r"""
LOONNEY    BVBA     0431.568 831
LOONNEY    BVBA     0431.568.832
LOONNEY    BVBA     0431 568.833
LOONNEY    BVBA     0431 568 834
X   number     1542 222. 325
X   number     1542.222. 326
Y   numberDONOTEXTRACT     23456 682 .12344566
"""
doc = nlp(text)

matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
for span in spacy.util.filter_spans(spans):
    print(span.text)

I get

0431.568 831
0431.568.832
0431 568.833
0431 568 834
1542 222. 325
1542.222. 326

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Wiktor Stribiżew