'How to deal with edge cases where fuzzy matching does not return a result?
I have two tables in a MySQL database:
addressthat includes adescriptionfield which at most has a city name and a country buried somewhere inside it. There are alsolatitudeandlongitudefieldscitythat includesnameandlatitudeandlongitude(sourced from geonames.org)
For each record in address I need to assign the city.id of a matching record in city.
My current solution involves using Python to cycle through each unassigned record in address and query city for all records where city.latitude and city.longitude are close to address.latitude and address.longitude. I then use a fuzzy matching function to compare address.description with the city.name of each city record and add the city.id of the highest scoring city record to the address record.
This works fine for 99.9% of the records but there are always some exceptions where address.description is just too different to city.name. For example "Cologne" vs "Koln".
I could include a dictionary of alternative city names inside the Python code but this isn't a scalable solution. I'm thinking that I need to have some kind of option that if Python can't find a fuzzy match on address.city_name then I build in some functionality so I can manually identify the city name in address.description and then add this to a city.alternative_name field. I'll then adapt the Python code to search the city.alternative_name field if a match can't be found against city.name field. Geonames actually includes a alternative city names column in their data extract so I could load this in as some base data.
Is a city.alternative_name field the best approach to deal with unmatchable records? Worth me mentioning that I'm not looking for advice on alternative Python packages for matching city names, e.g. geodict or geotxt. These have the same issue as my code - at some point they won't be able to find a match. I need to develop a solution for these edge cases.
I also need to think about the possibility of having multiple alternative city names and I could see things getting messy at that point. I guess I could create a city_alternative_name table where I could add multiple records with different city_alternative_name.name fields that were linked to the same city record...?
Solution 1:[1]
Your 99.9% is absolutely great! Since you wrote that when it fails it retrieves nothing, you may use a different fuzzy approach for this part. The idea of looking for alternative city names in a table is good, but it would fail if there's a typo in the city name.
You can use FuzzyWuzzy to do this. Pass to it a json dictionary or even a list of exceptional city names that are not available in geodict. Then, use your preferred ratio type. This will help you to extract the city and get the matching score to decide if you need further processing. Here's an example for the code you can use:
choices = ["Atlanta", "New York", "Florida", "Dallas"]
process.extractOne(" Villa 90, Soho Square, Florida, USA.", choices, scorer=fuzz.partial_ratio)
The result will be ("Florida", 100)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
