'Fuzzy matching for groups in pandas

I have the following dataset:

import pandas as pd
from fuzzywuzzy import fuzz 

adf = pd.DataFrame({'my_addresses':['234 Rue Morgue','234 R. Morgue','234 La rue morgue, 234','312 La rue moulin','24 Rue Marbeau','24 La Rue Marbeau', '28 Boulevard Suchet','28 Blvd Suchet'],'my_group':['A1','A1','A1','A1','B2','B2','B2','B2']})
adf

    my_addresses            my_group
0   234 Rue Morgue          A1
1   234 R. Morgue           A1
2   234 La rue morgue, 234  A1
3   312 La rue moulin       A1
4   24 Rue Marbeau          B2
5   24 La Rue Marbeau       B2
6   28 Boulevard Suchet     B2
7   28 Blvd Suchet          B2

And I want to apply a custom function. This function creates subgroups given a pandas object, using fuzzy match.

def match_groups(grp_obj, threshold = 67):
    groups = []
    for i in range(len(grp_obj.to_list())):
        groups.append(i+1)
    for i, val_i in enumerate(grp_obj.to_list()):
        for j, val_j in enumerate(grp_obj.to_list()):
            if j>i:
                ratio = fuzz.ratio(val_i, val_j)
                if ratio >= threshold:
                    groups[j] = groups[i]
    return groups

For example, let's calculate subgroups for the group B2:

match_groups(adf[adf['my_group']=='B2'].my_addresses, 67)

It returns this list:

[1, 1, 3, 3]

Ok. Now I want to apply this function to each group. This is the expected result:

    my_addresses            my_group    my_subgroup
0   234 Rue Morgue          A1          1
1   234 R. Morgue           A1          1
2   234 La rue morgue, 234  A1          1
3   312 La rue moulin       A1          4
4   24 Rue Marbeau          B2          1
5   24 La Rue Marbeau       B2          1
6   28 Boulevard Suchet     B2          3
7   28 Blvd Suchet          B2          3

So I tried this:

adf['my_subgroup'] = adf.groupby('my_group')['my_addresses'].apply(match_groups)

But it returns me:


    my_addresses            my_group    my_subgroup
0   234 Rue Morgue          A1          NaN
1   234 R. Morgue           A1          NaN
2   234 La rue morgue, 234  A1          NaN
3   312 La rue moulin       A1          NaN
4   24 Rue Marbeau          B2          NaN
5   24 La Rue Marbeau       B2          NaN
6   28 Boulevard Suchet     B2          NaN
7   28 Blvd Suchet          B2          NaN

I think my problem is that the function returns a list, but I'm not sure how to pass this to the data frame as the column required. Please, any help will be appreciated.



Solution 1:[1]

One way might be to create a parallel DataFrame, then join. Here are a couple of variations on that approach. There may well be a better way.

Here's a slightly modified match_groups function, so that it takes a Series rather than a DataFrame:

def match_groups(addresses, threshold):
    subgroups = [i for i in range(1, len(addresses)+1)]
    for i, val_i in enumerate(addresses):
        for j, val_j in enumerate(addresses):
            if j>i:
                ratio = fuzz.ratio(val_i, val_j)
                if ratio >= threshold:
                    subgroups[j] = subgroups[i]
    return subgroups

This still returns a list.

By the way, computing subgroups this way can give different results depending on the order it sees the input, so be careful.

Using concat and join

We make one DataFrame for each group, computing our subgroups, and preserving the index values from the original DataFrame, adf, then paste them back together with concat:

df_subgroups = pd.concat(
    pd.DataFrame({
        "my_subgroup": match_groups(
            adf.my_addresses[adf.my_group==grp],
            threshold=67)},
        index=adf.index[adf.my_group==grp])
    for grp in adf.my_group.unique()
)

The result looks like this:

   my_subgroup
0            1
1            1
2            1
3            4
4            1
5            1
6            3
7            3

Now, join on the shared index:

adf.join(df_subgroups)

             my_addresses my_group  my_subgroup
0          234 Rue Morgue       A1            1
1           234 R. Morgue       A1            1
2  234 La rue morgue, 234       A1            1
3       312 La rue moulin       A1            4
4          24 Rue Marbeau       B2            1
5       24 La Rue Marbeau       B2            1
6     28 Boulevard Suchet       B2            3
7          28 Blvd Suchet       B2            3

Using groupby, transform, and join

We can do a little better with groupby and transform. Transform takes care of the concat-ing and and index preserving step above.

df_subgroups = adf.groupby('my_group').transform(match_groups, threshold=67)
df_subgroups.columns = ["my_subgroup"]

We get the same result as before - a DataFrame with the original index preserved so we can join. Annoyingly, it names the new column "my_addresses", so we give it the name we want.

   my_subgroup
0            1
1            1
2            1
3            4
4            1
5            1
6            3
7            3

Then, apply the same join as before:

adf.join(df_subgroups)

             my_addresses my_group  my_subgroup
0          234 Rue Morgue       A1            1
1           234 R. Morgue       A1            1
2  234 La rue morgue, 234       A1            1
3       312 La rue moulin       A1            4
4          24 Rue Marbeau       B2            1
5       24 La Rue Marbeau       B2            1
6     28 Boulevard Suchet       B2            3
7          28 Blvd Suchet       B2            3

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1