'Rule-based matching don't return the same value when i use simple regex in python
I execute a regex with python and I get the result I want
import re
s = """
LOONNEY BVBA 0431.568 836
Cock number 1542 222. 325
"""
expression = r"(\d+)(\.|\s)(\d+)?(\.|\s|\.\s)(\d+)"
this return :
0431.568 836
1542 222. 325
Then I transport the regex to do a spacy via rule-based and nothing happened
nlp = spacy.blank("en")
patterns = [
{
"label": "VAT_NUMBER",
"pattern": [{"TEXT": {"REGEX": r"(\d+)(\.|\s)(\d+)?(\.|\s|\.\s)(\d+)"}}]
}
]
#add patterns to ruler
ruler = EntityRuler(nlp)
nlp.add_pipe(ruler)
ruler.add_patterns(patterns)
#create the doc
doc = nlp(text)
#extract entities
for ent in doc.ents:
print (ent.text, ent.label_)
return nothing;
what's wrong? May be the or | or space \s?
Solution 1:[1]
You cannot use a regex that is supposed to match multiple tokens within a single token REGEX operator pattern. You can always check what tokens you get with a code like print( [t.text for t in doc] ). You will quickly find out that 0431.568 831 is split into ['0431.568', '831'], 0431 568.833 is split into ['0431', '568.833'], etc.
Since you keep the actual requirements to yourself, here are is a sample solution for the strings you provided and a bit more cases that I tested here.
import spacy
from spacy.matcher import Matcher
from spacy.lang.en import English
nlp = English()
pattern = [
[{"TEXT": {"REGEX": "^\d+\.\d+\.\d+$"}}], # Line 2: If three numbers are dot separated
[{"TEXT": {"REGEX": "^\d+\.\d+\.?$"}}, {"TEXT": ".", "OP": "?"}, {'IS_DIGIT': True}], # Line 1: If two first numbers are dot separated + a number
[{'IS_DIGIT': True}, {"TEXT": {"REGEX": "^\.?\d+\.\d+$"}}], # Line 3: If a number + two last numbers are dot separated
[{'IS_DIGIT': True}, {'IS_DIGIT': True}, {'TEXT': '.', 'OP': '?'}, {'IS_DIGIT': True}], # Line 4: all spaces
]
matcher = Matcher(nlp.vocab)
matcher.add("codez", pattern)
text = r"""
LOONNEY BVBA 0431.568 831
LOONNEY BVBA 0431.568.832
LOONNEY BVBA 0431 568.833
LOONNEY BVBA 0431 568 834
X number 1542 222. 325
X number 1542.222. 326
Y numberDONOTEXTRACT 23456 682 .12344566
"""
doc = nlp(text)
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
for span in spacy.util.filter_spans(spans):
print(span.text)
I get
0431.568 831
0431.568.832
0431 568.833
0431 568 834
1542 222. 325
1542.222. 326
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Wiktor Stribiżew |
