'Difference Between Local Nominatim and OpenStreetMap Website While Geocoding

I have installed nominatim via following instructions on nominatim documentation and it's really great project. My local instance is running on docker via using this github repository as suggested on documentation.

I have almost 13k address data and need to geocode those addresses, my first attempt on using local instance got almost ~6k successful geocode result.

When I sent some of unsuccessful results to Nominatim Website I got the results that local instance did not respond successfully.

Consider following example which is an address in Slovakia, Ruzová 173/35 Borsa 7632 SK

Nominatim Website returns successful result for this query but my local instance do not.

Attempt #1 I checked the detail pages and found that local instance's Software Version is 4.0.1 whereas website's version is 4.0.99-4.

So I build the source code from master branch of the nominatim repository. And now my local instance is 4.0.99-5. Unfortunately this also did not solve my issue but I continue to use 4.0.99-5 on following attempts.

Attempt #2 I added wikipedia data while building the local instance, thought that it would be helpful but it didn't.

Attempt #3 I changed the locale of the container to sk_SK.UTF-8 I thought it might be a language issue, no way it didn't work.

What I discover is, when I fix the postal code with 0 prefix; in this case converting 7632 to 07632 problem is solved, in some cases when I completely delete postal code it works fine or when I seperate it like 85 107 it also works but I really don't know why. This trick is really improved my coverage. Almost 9k addresses are geocoded(~3k improvement).

I hope it is clear for you and even small advice or comment would be very helpful for me. Here is my possible road maps and future attempts that I can choose:

Possible Attempt #4

  • Should I change my tokenizer? Does Nominatim Website use NOMINATIM_TOKENIZER=icu or NOMINATIM_TOKENIZER=legacy? Right now I'm using legacy and maybe Attempt #4 should be changing it to icu, because when I dive into debug mode I see following output:

Nominatim Site

Tokenization
Phrase:  'ruzova 173 35 borsa 7632 sk'
Tokens:  'ruzova' => 'ruzova'
         'ruzova 173' => 'ruzova 173'
         'ruzova 173 35' => 'ruzova 173 35'
         'ruzova 173 35 borsa' => 'ruzova 173 35 borsa'
         'ruzova 173 35 borsa 7632' => 'ruzova 173 35 borsa 7632'
         'ruzova 173 35 borsa 7632 sk' => 'ruzova 173 35 borsa 7632 sk'
         0 => '173'
         '173 35' => '173 35'
         '173 35 borsa' => '173 35 borsa'
         '173 35 borsa 7632' => '173 35 borsa 7632'
         '173 35 borsa 7632 sk' => '173 35 borsa 7632 sk'
         1 => '35'
         '35 borsa' => '35 borsa'
         '35 borsa 7632' => '35 borsa 7632'
         '35 borsa 7632 sk' => '35 borsa 7632 sk'
         'borsa' => 'borsa'
         'borsa 7632' => 'borsa 7632'
         'borsa 7632 sk' => 'borsa 7632 sk'
         2 => '7632'
         '7632 sk' => '7632 sk'
         'sk' => 'sk'
WordLists:  0 => 0 => 'ruzova'
                 1 => '173'
                 2 => '35'
                 3 => 'borsa'
                 4 => '7632'
                 5 => 'sk'
SELECT word_id, word_token, type, word, info->>'op' as operator, info->>'class' as class, info->>'type' as ctype, info->>'count' as count FROM word WHERE word_token in ('ruzova','ruzova 173','ruzova 173 35','ruzova 173 35 borsa','ruzova 173 35 borsa 7632','ruzova 173 35 borsa 7632 sk','173','173 35','173 35 borsa','173 35 borsa 7632','173 35 borsa 7632 sk','35','35 borsa','35 borsa 7632','35 borsa 7632 sk','borsa','borsa 7632','borsa 7632 sk','7632','7632 sk','sk')

Local

Tokenization
SELECT make_standard_name(:0) as p0

SQL parameters:  ':0' => 'Ruzová 173/35 Borsa 7632 SK'
SQL result:  'p0' => 'ruzova 173 35 borsa 7632 sk'
Tokens:  ' ruzova' => ' ruzova'
         'ruzova' => 'ruzova'
         ' ruzova 173' => ' ruzova 173'
         'ruzova 173' => 'ruzova 173'
         ' ruzova 173 35' => ' ruzova 173 35'
         'ruzova 173 35' => 'ruzova 173 35'
         ' ruzova 173 35 borsa' => ' ruzova 173 35 borsa'
         'ruzova 173 35 borsa' => 'ruzova 173 35 borsa'
         ' ruzova 173 35 borsa 7632' => ' ruzova 173 35 borsa 7632'
         'ruzova 173 35 borsa 7632' => 'ruzova 173 35 borsa 7632'
         ' ruzova 173 35 borsa 7632 sk' => ' ruzova 173 35 borsa 7632 sk'
         'ruzova 173 35 borsa 7632 sk' => 'ruzova 173 35 borsa 7632 sk'
         ' 173' => ' 173'
         173 => '173'
         ' 173 35' => ' 173 35'
         '173 35' => '173 35'
         ' 173 35 borsa' => ' 173 35 borsa'
         '173 35 borsa' => '173 35 borsa'
         ' 173 35 borsa 7632' => ' 173 35 borsa 7632'
         '173 35 borsa 7632' => '173 35 borsa 7632'
         ' 173 35 borsa 7632 sk' => ' 173 35 borsa 7632 sk'
         '173 35 borsa 7632 sk' => '173 35 borsa 7632 sk'
         ' 35' => ' 35'
         35 => '35'
         ' 35 borsa' => ' 35 borsa'
         '35 borsa' => '35 borsa'
         ' 35 borsa 7632' => ' 35 borsa 7632'
         '35 borsa 7632' => '35 borsa 7632'
         ' 35 borsa 7632 sk' => ' 35 borsa 7632 sk'
         '35 borsa 7632 sk' => '35 borsa 7632 sk'
         ' borsa' => ' borsa'
         'borsa' => 'borsa'
         ' borsa 7632' => ' borsa 7632'
         'borsa 7632' => 'borsa 7632'
         ' borsa 7632 sk' => ' borsa 7632 sk'
         'borsa 7632 sk' => 'borsa 7632 sk'
         ' 7632' => ' 7632'
         7632 => '7632'
         ' 7632 sk' => ' 7632 sk'
         '7632 sk' => '7632 sk'
         ' sk' => ' sk'
         'sk' => 'sk'
WordLists:  0 => 0 => 'ruzova'
                 1 => '173'
                 2 => '35'
                 3 => 'borsa'
                 4 => '7632'
                 5 => 'sk'
SELECT word_id, word_token, word, class, type, country_code, operator, coalesce(search_name_count, 0) as count FROM word WHERE word_token in (' ruzova','ruzova',' ruzova 173','ruzova 173',' ruzova 173 35','ruzova 173 35',' ruzova 173 35 borsa','ruzova 173 35 borsa',' ruzova 173 35 borsa 7632','ruzova 173 35 borsa 7632',' ruzova 173 35 borsa 7632 sk','ruzova 173 35 borsa 7632 sk',' 173','173',' 173 35','173 35',' 173 35 borsa','173 35 borsa',' 173 35 borsa 7632','173 35 borsa 7632',' 173 35 borsa 7632 sk','173 35 borsa 7632 sk',' 35','35',' 35 borsa','35 borsa',' 35 borsa 7632','35 borsa 7632',' 35 borsa 7632 sk','35 borsa 7632 sk',' borsa','borsa',' borsa 7632','borsa 7632',' borsa 7632 sk','borsa 7632 sk',' 7632','7632',' 7632 sk','7632 sk',' sk','sk')

Possible Attempt #5

  • Should I change my PostgreSQL instance? Currently using: starting PostgreSQL 12.9 (Ubuntu 12.9-0ubuntu0.20.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit

Your feedback really matters to me and all supports are accepted, thank you!



Solution 1:[1]

I had a very similar problem and solved it by running nominatim special-phrases --import-from-wiki as suggested (even thought without giving it too much importance) in the documentation. Now my local instance returns the very same results as the online one.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Tommaso De Lorenzo