'Fuzzy matching for groups in pandas
I have the following dataset:
import pandas as pd
from fuzzywuzzy import fuzz
adf = pd.DataFrame({'my_addresses':['234 Rue Morgue','234 R. Morgue','234 La rue morgue, 234','312 La rue moulin','24 Rue Marbeau','24 La Rue Marbeau', '28 Boulevard Suchet','28 Blvd Suchet'],'my_group':['A1','A1','A1','A1','B2','B2','B2','B2']})
adf
my_addresses my_group
0 234 Rue Morgue A1
1 234 R. Morgue A1
2 234 La rue morgue, 234 A1
3 312 La rue moulin A1
4 24 Rue Marbeau B2
5 24 La Rue Marbeau B2
6 28 Boulevard Suchet B2
7 28 Blvd Suchet B2
And I want to apply a custom function. This function creates subgroups given a pandas object, using fuzzy match.
def match_groups(grp_obj, threshold = 67):
groups = []
for i in range(len(grp_obj.to_list())):
groups.append(i+1)
for i, val_i in enumerate(grp_obj.to_list()):
for j, val_j in enumerate(grp_obj.to_list()):
if j>i:
ratio = fuzz.ratio(val_i, val_j)
if ratio >= threshold:
groups[j] = groups[i]
return groups
For example, let's calculate subgroups for the group B2:
match_groups(adf[adf['my_group']=='B2'].my_addresses, 67)
It returns this list:
[1, 1, 3, 3]
Ok. Now I want to apply this function to each group. This is the expected result:
my_addresses my_group my_subgroup
0 234 Rue Morgue A1 1
1 234 R. Morgue A1 1
2 234 La rue morgue, 234 A1 1
3 312 La rue moulin A1 4
4 24 Rue Marbeau B2 1
5 24 La Rue Marbeau B2 1
6 28 Boulevard Suchet B2 3
7 28 Blvd Suchet B2 3
So I tried this:
adf['my_subgroup'] = adf.groupby('my_group')['my_addresses'].apply(match_groups)
But it returns me:
my_addresses my_group my_subgroup
0 234 Rue Morgue A1 NaN
1 234 R. Morgue A1 NaN
2 234 La rue morgue, 234 A1 NaN
3 312 La rue moulin A1 NaN
4 24 Rue Marbeau B2 NaN
5 24 La Rue Marbeau B2 NaN
6 28 Boulevard Suchet B2 NaN
7 28 Blvd Suchet B2 NaN
I think my problem is that the function returns a list, but I'm not sure how to pass this to the data frame as the column required. Please, any help will be appreciated.
Solution 1:[1]
One way might be to create a parallel DataFrame, then join. Here are a couple of variations on that approach. There may well be a better way.
Here's a slightly modified match_groups function, so that it takes a Series rather than a DataFrame:
def match_groups(addresses, threshold):
subgroups = [i for i in range(1, len(addresses)+1)]
for i, val_i in enumerate(addresses):
for j, val_j in enumerate(addresses):
if j>i:
ratio = fuzz.ratio(val_i, val_j)
if ratio >= threshold:
subgroups[j] = subgroups[i]
return subgroups
This still returns a list.
By the way, computing subgroups this way can give different results depending on the order it sees the input, so be careful.
Using concat and join
We make one DataFrame for each group, computing our subgroups, and preserving the index values from the original DataFrame, adf, then paste them back together with concat:
df_subgroups = pd.concat(
pd.DataFrame({
"my_subgroup": match_groups(
adf.my_addresses[adf.my_group==grp],
threshold=67)},
index=adf.index[adf.my_group==grp])
for grp in adf.my_group.unique()
)
The result looks like this:
my_subgroup
0 1
1 1
2 1
3 4
4 1
5 1
6 3
7 3
Now, join on the shared index:
adf.join(df_subgroups)
my_addresses my_group my_subgroup
0 234 Rue Morgue A1 1
1 234 R. Morgue A1 1
2 234 La rue morgue, 234 A1 1
3 312 La rue moulin A1 4
4 24 Rue Marbeau B2 1
5 24 La Rue Marbeau B2 1
6 28 Boulevard Suchet B2 3
7 28 Blvd Suchet B2 3
Using groupby, transform, and join
We can do a little better with groupby and transform. Transform takes care of the concat-ing and and index preserving step above.
df_subgroups = adf.groupby('my_group').transform(match_groups, threshold=67)
df_subgroups.columns = ["my_subgroup"]
We get the same result as before - a DataFrame with the original index preserved so we can join. Annoyingly, it names the new column "my_addresses", so we give it the name we want.
my_subgroup
0 1
1 1
2 1
3 4
4 1
5 1
6 3
7 3
Then, apply the same join as before:
adf.join(df_subgroups)
my_addresses my_group my_subgroup
0 234 Rue Morgue A1 1
1 234 R. Morgue A1 1
2 234 La rue morgue, 234 A1 1
3 312 La rue moulin A1 4
4 24 Rue Marbeau B2 1
5 24 La Rue Marbeau B2 1
6 28 Boulevard Suchet B2 3
7 28 Blvd Suchet B2 3
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
