'Creating Rule-based matching with SpaCy and Python for detecting addresses

I have started learning Python's SpaCy lib or NLP a few days ago. I want to create Rule-based matching for detecting street addresses. This is the example of street names:

Esplanade 12
Fischerinsel 65
Esplanade 1
62 boulevard d'Alsace
80 avenue Ferdinand de Lesseps
73 avenue de Bouvines
41 Avenue des Pr'es
84 rue du Château
44 rue Sadi Carnot
Bernstrasse 324
Güntzelstrasse 6
80 Rue St Ferréol
75 rue des lieutemants Thomazo
87 cours Franklin Roosevelt
51 rue du Paillle en queue
16 Chemin Des Bateliers
65 rue Reine Elisabeth
91 rue Saint Germain
Grolmanstraße 41
Buelowstrasse 46
Waßmannsdorfer Chaussee 41
Sonnenallee 29
Gotthardstrasse 81
Augsburger Straße 65
Gotzkowskystrasse 41
Holstenwall 69
Leopoldstraße 40

So, street names are formed like this:

1st type:

<string (thats ending with 'strasse', 'gasse' or 'platz')> + <number>(letter can be attached to number, for examle 34a)

2nd type:

<number> + <'rue', 'avenue', 'platz', 'boulevard'> + <multiple strings strings>

3rd type:

<titled string> + <number>

But first two types are 90% of cases. This is the code:

import spacy
from spacy.matcher import Matcher
from spacy import displacy

nlp = spacy.load("en_core_web_trf")
disable = ['ner']
pattern = ['<i do not know how to write contitions for this>']

matcher = Matcher(nlp.vocab)
matcher.add("STREET", [pattern])

text_testing1 = "I live in Güntzelstrasse 16 in Berlin"
text_testing2 = "Send that to 73 rue de Napoleon 56 in Paris"

doc = nlp(text)
result = matcher(doc)
print(result)

I do not know how to write pattern for this kind of recognition, so I need help with that. Phrase needs to have number in it, one of the strings must be 'rue', 'avenue', 'platz', 'boulevard' or it has to end with "strasse" or "gasse".



Solution 1:[1]

Here's a very simple example that matches just things like "*strasse [number]":

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [
        {"TEXT": {"REGEX": ".*strasse$"}}, 
        {"IS_DIGIT": True}
        ]
matcher.add("ADDRESS", [pattern])

doc = nlp("I live in Güntzelstrasse 16 in Berlin")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

The key part is the pattern. By changing the pattern you can make it match more things, for example if we want to match things that end in not just strasse but also platz:

pattern = [
        {"TEXT": {"REGEX": ".*(strasse|platz)$"}}, 
        {"IS_DIGIT": True}
        ]

You can also add multiple patterns with the same label to get very different structures, like for your "rue de Napoleon" example.

The Matcher has a lot of features, I really recommend reading through the docs and trying them all out once.

Solution 2:[2]

I agree with polm. The only problem is that you won't find addresses with numbers and letters, like examplestrasse 32a. I think you should experiment with shapes, e.g.:

pattern = [[
        {"TEXT": {"REGEX": ".*(strasse|platz)$"}}, 
        {"SHAPE": {"IN": ["ddx", "dx"]}}
]]

wherein shape d is a digit and x is a lowercase letter (X being uppercase). Definitely read the docs, they are great in spacy

Solution 3:[3]

For a more general solution to fetch (German) street names based on @polm23 and @krisograbek, I came up with this pattern:

street_labels = ".*(platz|[Ss]tra[ssß]e|str)$"

patterns = [
    {"label": "ADR", 
     "pattern": [
         {"TEXT": {"REGEX": street_labels}}, 
         # here might be a punct or not: Müllerstr. 26 or Müllerstr 26
         {"IS_PUNCT": True, "OP": "?"}, 
         # house number can have several formats: 2, 26, 266, 2a, 22a, 222a, 
         # last six ones catch cases at end of sentence. there might be a better solution out there... 
         {"SHAPE": {"IN": ["d", "dd", "ddd", "dddx", "ddx", "dx", "d.", "dd.", "ddd.", "dx.", "ddx.", "dddx."]}, "OP": "?"}
     ]},
    # if street name has to parts: Müller Straße
     {"label": "ADRddd", 
      "pattern": [
          {"SHAPE": "Xxxxx", "OP": "?"}, 
          {"TEXT": "Straße"}, 
          {"IS_PUNCT": True, "OP": "?"}, 
          {"SHAPE": {"IN": ["d", "dd", "ddd", "dddx", "ddx", "dx", "d.", "dd.", "ddd."]}, "OP": "?"}
      ]}
    ]

It matches:

Müllerstr. 26
Müllerstr 26
Müllerstraße
Müllerstraße 26
Müllerplatz
Müllerstraße 26a
Müller Straße 26

One thing is weird: If the house number at the end of a sentence, then Spacy adds the punct to the token. So that case needs to be considered as well.

To add cases with house numbers before street name, it can be considered with an optional SHAPE.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 polm23
Solution 2 krisograbek
Solution 3