'Fuzzy merge with python pandas
I would like to expand the solution below from this question, to not only include the most close match, but also merge the corresponding columns from df3.
I have been looking at the function but I do not completely understand how I should alter the function to include all columns from df3.
Example datframe
import pandas as pd
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
df3 = {
'Key' : ['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry'],'second_col':['A', 'B', 'C', 'D', 'E', 'F']
}
df3 = pd.DataFrame(df3)
print(df3)
# df1
Key
0 Apple
1 Banana
2 Orange
3 Strawberry
# df2
Key
0 Aple
1 Mango
2 Orag
3 Straw
4 Bannanna
5 Berry
# df3
Key second_col
0 Aple A
1 Mango B
2 Orag C
3 Straw D
4 Bannanna E
5 Berry F
Function for fuzzy matching
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=1):
"""
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
Using our function on the dataframes: #1
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80, limit=2)
Key matches
0 Apple Aple
1 Banana Bannanna
2 Orange Orag
3 Strawberry Straw, Berry
Using our function on the dataframes: #2
df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})
fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)
Col1 matches
0 Microsoft Mcrsoft
1 Google gogle
2 Amazon Amason
3 IBM
Using our function on the dataframes: #3
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
fuzzy_merge(df1, df3, 'Key', 'Key', threshold=80)
Key matches
0 Apple Aple
1 Banana Bannanna
2 Orange Orag
3 Strawberry Berry
Installation:
Pip
pip install fuzzywuzzy
Anaconda
conda install -c conda-forge fuzzywuzzy
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
