'Fuzzy merge with python pandas

I would like to expand the solution below from this question, to not only include the most close match, but also merge the corresponding columns from df3.

I have been looking at the function but I do not completely understand how I should alter the function to include all columns from df3.


Example datframe

import pandas as pd 
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

df3 = {
    'Key' : ['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry'],'second_col':['A', 'B', 'C', 'D', 'E', 'F']
}
df3 = pd.DataFrame(df3)

print(df3) 

# df1
          Key
0       Apple
1      Banana
2      Orange
3  Strawberry

# df2
        Key
0      Aple
1     Mango
2      Orag
3     Straw
4  Bannanna
5     Berry

# df3

        Key second_col
0      Aple          A
1     Mango          B
2      Orag          C
3     Straw          D
4  Bannanna          E
5     Berry          F

Function for fuzzy matching

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=1):
    """
    :param df_1: the left table to join
    :param df_2: the right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: how close the matches should be to return a match, based on Levenshtein distance
    :param limit: the amount of matches that will get returned, these are sorted high to low
    :return: dataframe with boths keys and matches
    """
    s = df_2[key2].tolist()
    
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m
    
    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2
    
    return df_1

Using our function on the dataframes: #1

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80, limit=2)

          Key       matches
0       Apple          Aple
1      Banana      Bannanna
2      Orange          Orag
3  Strawberry  Straw, Berry

Using our function on the dataframes: #2

df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})

fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)

        Col1  matches
0  Microsoft  Mcrsoft
1     Google    gogle
2     Amazon   Amason
3        IBM         

Using our function on the dataframes: #3

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzzy_merge(df1, df3, 'Key', 'Key', threshold=80)

          Key       matches
0       Apple          Aple
1      Banana      Bannanna
2      Orange          Orag
3  Strawberry         Berry

Installation:

Pip

pip install fuzzywuzzy

Anaconda

conda install -c conda-forge fuzzywuzzy


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source