'How to merge on approximate strings?
I would like to merge 2 Data Frames on approximate country name with merge as of but i am getting the following error:
TypeError:'NoneType' object is not callable
Please see the illustrative code below:
cl = {'Country' : ["Brazil", "US", "Russia"], 'BL?':['No', 'No','Yes']}
clist = pd.DataFrame.from_dict(cl)
cd = {'Country' : ["Braizl", "us", "Rusia"]}
cdata = pd.DataFrame.from_dict(cd)
clist = clist.sort_values('Country')
cdata = cdata.sort_values('Country')
cdata = pd.merge_asof(cdata,clist,on='Country')
The expected result would merge the two dfs and the cdata df would have the 'BL?' column with YES/NO values.
Thank you in advance!
Solution 1:[1]
This should get you close but it will not be 100% accurate. You can use fuzzywuzzy. fuzzywuzzy uses Levenshtein distance to calculate the difference between two strings:
from fuzzywuzzy import process
# create a choice list
choices = clist['Country'].values.tolist()
# apply fuzzywuzzy to each row using lambda expression
cdata['Close Country'] = cdata['Country'].apply(lambda x: process.extractOne(x, choices)[0])
# merge
cdata.merge(clist, left_on='Close Country', right_on='Country')
Country_x Close Country Country_y BL?
0 Braizl Brazil Brazil No
1 Rusia Russia Russia Yes
2 us US US No
You can even return the percent match and only keep values > n if you want to only keep matches say greater than 85%
add percent match
from fuzzywuzzy import process
# create a choice list
choices = clist['Country'].values.tolist()
# apply fuzzywuzzy to each row using lambda expression
cdata['Close Country'] = cdata['Country'].apply(lambda x: process.extractOne(x, choices))
# add percent match wiht apply
cdata[['Close Country', 'Percent Match']] = cdata['Close Country'].apply(pd.Series)
# merge
cdata.merge(clist, left_on='Close Country', right_on='Country')
Country_x Close Country Percent Match Country_y BL?
0 Braizl Brazil 83 Brazil No
1 Rusia Russia 91 Russia Yes
2 us US 100 US No
you can do boolean indexing either before the merge to remove bad match then merge:
cdata[['Close Country', 'Percent Match']] = cdata['Close Country'].apply(pd.Series)
cdata = cdata[cdata['Percent Match']>85]
or you can do it after the merge:
merge = cdata.merge(clist, left_on='Close Country', right_on='Country')
merge[merge['Percent Match'] > 85]
fuzzywuzzy returns the percent match as part of the process function. In the first example I removed it by calling the first element of the tuple: process.extractOne(x, choices)[0]
Solution 2:[2]
Given your example I came to a solution. This is not very pythonic but it works! (assuming you have a matching country name in clist for every cdata mispelled country)
def get_closest(x, column):
tmp = 1000
for i2, r2 in clist.iterrows():
levenshtein = editdistance.eval(x,r2['Country'])
if levenshtein <= tmp:
tmp = levenshtein
res = r2
return res['BL?']
cdata['BL'] = cdata['Country'].apply(lambda x: get_closest(x, clist))
Output :
Country BL
0 Braizl No
1 us No
2 Rusia Yes
I am using the editdistance library to compute the levenshtein distance. You can install it with pip :
pip install editdistance
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | ldoe |
