'Regex optional groups and digit length

Maybe some regex-Master can solve my problem.

I have a big list with many addresses with no seperators( , ; ). The address string contains following Information:

  • The first group is the street name
  • The second group is the street number
  • The third group is the zipcode (optional)
  • The last group is the town name (optional)

regex_png

As you can see on the image above the last two test strings are not matching. I need the last two regex groups to be optional and the third group should be either 4 or 5 digits.

I tried (\d{4,5}) for allowing 4 and 5 digits. But this only works halfways as you can see here: https://regex101.com/r/ZurqHh/1
regex_4_5_digits (This sometimes mixes the street number and zipcode together)

I also tried (?:\d{5})? to make the third and fourth group optional. But this destroys my whole group layout... https://regex101.com/r/EgxeMy/1

regex_optional

This is my current regex:

/^([a-zäöüÄÖÜß\s\d.,-]+?)\s*([\d\s]+(?:\s?[-|+\/]\s?\d+)?\s*[a-z]?)?\s*(\d{5})\s*(.+)?$/im

Try it out yourself: https://regex101.com/r/zC8NCP/1

My brain is only farting at this moment and i can't think straight anymore.

Please help me fix this problem so i can die in peace.



Solution 1:[1]

It is difficult to parse addresses because we are halfway between formatted text and natural language. Here is a pattern that tries as much as possible to reduce the number of optional parameters to succeed with the examples offered without asking too much to the regex engine. To do this, I mainly rely on character classes, atomic groups, and a relatively accurate description of the street names. Obviously, all the examples of the question cannot be representative of reality and characters could be added or removed from the classes to deal with new cases. Nevertheless, the structure of this pattern is a good starting point.

~
^
(?<strasse> [\pL\d-]+ \.? (?> \h+ [\pL\d-]+ \.? )*? ) \h*
(?<nummer> \b (?> \d+ | [-+/\h]+ | [a-z] \b )*? )
(?: \h+ (?<plz> \d{4,5} )
    \h+ (?<stadt> .+ ) )?
$
~mxui

demo

Note that in the above link you can also see a previous version of this pattern with a more accurate description of the street number (a bit more efficient but longer).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1