'How to ensure correct column linkin if names are fuzzy? In python
I have .txt file of 400k records - read in OCR from receipts. I focus on 2 columns: store_id and address_store (table in attached pic). Inthe real world each store_id should be linked to a single store address, but there were some OCR errors.
What, I have noticed:
- most ids are correctly linked;
- there are 3 types of errors in id linkin (mispell,999, and blank)
- additionally addresses are fuzzy names. table
What kind of algorithm / model whould be the best solution here? Unfortunately, I dont have any dcitionary for correct names.
If I used wrong terminology, please correct me.
Solution 1:[1]
I don't think there is a simple answer that will allow you to clean up this data, and I can't think of a method that is not somewhat iterative. Given the table you have provided I am assuming all 8 of these should be
- store_id = 156
- store_addr = 1 Banana Way, Heaven
I am also assuming the important information is the store_id - that will then be matched to something else - so the key outcome is to map each record to the correct store_id.
Given those assumptions are correct my approach would be iterative, and to be honest a number of these techniques I would find easier in SQL (which I have decades experience using) rather than Python (that I have only been using for a few years) - but that may be a function of familiarity.
The absolute first thing I would do is scan for whitespace anomolies, particularly if OCR is your datasource. You will almost certainly have double spaces, wide spaces etc (i.e. all of the unicode characters between 2000 and 2016) convert these all to ascii 32. Likewise with dashes and underscores. OCR is trying to be accurate but for this exercise it is just noise.
From there:
- Find a unique list of store_address, with a count of records and distinct store_id for each store_address
- If your OCR is of reasonable quality a small number of addresses will cover the majority of the records (e.g. on this list of eight 50% are '1 Banana Way, Heaven' no more than 1 other record has any of the other store_address values - not quite pareto but heading that way)
- Sort the list of addresses with high hit rate and look for minor mispellings (e.g. '1 Banana Way Heaven' is only 1 character away [the missing comma] so match that as a misspelling).
- Decide on the correct store_id for each of the addresses you have identified.
- The correct store_id can then be assigned to the 126, 999& blank record
- Use the list from 4 to match the store_id of unmatched addresses ? should be fairly easy to see 'Fruit store, 1 Banana hay' is a modification of '1 Banana Way, Heaven'
- Use the cross match list from 6 to find all the variations of 'Fruit store, 1 Banana hay' and assign them to store 156
If the OCR source for this is of fairly high quality, you should find between about 90% and 99%+ are now mapped. On your dataset only '1 Banana Hay, Heaven' and '1 Bananaway Heaven' are not matched.
If it is above 99% then you will probably find the remainder are best done by manual inspection (tedious, but probably quicker than most of the code approaches).
If it is closer to 90% you need to do one or more of the following:
- Shift everything to upercase and repeat the above
- Eliminate the punctuation and repeat the above
- Eliminate the white space and repeat the above
By this stage your list above would be clean, however if you are still not down to a list where it is manageable by inspection, you are needing to look at one of the phonetic algoritms (Soundex https://en.wikipedia.org/wiki/Soundex) is widely implemented Is there a soundex function for python? although there are other more sophisticated ones.) My experience is using Soundex before you are to this stage is probably going to be of limited help - Soundex does a good job finding Lochlan versus Lachlan (which sound basically the same), but is not good where you have OCR where it is the written characters that vary.
From a technical perspective (in Python):
txt.replace()andtxt.translate()are powerful tools to fix the whitespace problemlst.sort(key=store_addr)to allow you to sequence- From about step 3 above, you will need to start maintaining 2 dictionaries
- link store_number and valid addresses
- link OCR address and valid address You could link OCR to store number but this will be harder to inspect.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | apc |
