'Create non-duplicate cross product with custom column name
Consider this dataframe:
Animal Name Months
0 dog Rosie 45
1 cat Archie 12
2 snake Archibald 14
Using this code
new_df = pd.DataFrame(product(df1['Name'], df1['Name']), columns=["id1","id2"])
new_df["LevScore"] = new_df.apply(lambda x: levenshtein(x[0],x[1]), axis=1)
I'm able to create this table:
id1 id2 LevScore
0 Rosie Rosie 0
1 Rosie Archie 4
2 Rosie Archibald 8
3 Archie Rosie 4
4 Archie Archie 0
5 Archie Archibald 4
6 Archibald Rosie 8
7 Archibald Archie 4
8 Archibald Archibald 0
Which contains duplicate pairs (e.g. Rosie-Archie, Archie-Rosie)
Thus, I'm able to remove duplicate pairs using
new_df = pd.DataFrame(combinations_with_replacement(df1['Name'], 2), columns=["id1","id2"])
new_df["LevScore"] = new_df.apply(lambda x: levenshtein(x[0],x[1]), axis=1)
and I get this:
id1 id2 LevScore
0 Rosie Rosie 0
1 Rosie Archie 4
2 Rosie Archibald 8
4 Archie Archie 0
5 Archie Archibald 4
8 Archibald Archibald 0
Next, instead of writing the full name of each animal in id1 and id2, I would like to just have their row index. Rosie's row is on index 0, Archie's is on index 1, and Archibald's is on index 2. Thus, I would like to create the following table
id1 id2 LevScore
0 0 0 0
1 0 1 4
2 0 1 8
4 1 1 0
5 1 2 4
8 2 2 0
Now, I would like to write to multiple columns. Say that the levenshtein function returns two values (distance and similarity score), I would like to write these two values to two different columns, such as this:
id1 id2 LevScore Similarity
0 0 0 0 1.00
1 0 1 4 0.45
2 0 1 8 0.32
4 1 1 0 1.00
5 1 2 4 0.89
8 2 2 0 1.00
Next, I would like to construct this dataframe so it only writes to it under a certain condition, such as that 'LevScore' < 6, in which case I would get the following dataframe:
id1 id2 LevScore Similarity
0 0 0 0 1.00
1 0 1 4 0.45
4 1 1 0 1.00
5 1 2 4 0.89
8 2 2 0 1.00
Lastly, I would like to not have texts (in this case, values in 'Name' column) consider themselves (because the Levenshtein distance will always be 0, which is useless information), so I would only really be left with:
id1 id2 LevScore Similarity
1 0 1 4 0.45
5 1 2 4 0.89
Is there any way to combine all of these into one efficient algorithm? I have a large dataset, where efficiency is really crucial. Thank you in advance!
Solution 1:[1]
Using your approach, you can directly use the index in combinations and reference the first dataframe in your apply:
from itertools import combinations_with_replacement
from Levenshtein import distance as levenshtein
(pd.DataFrame(combinations_with_replacement(df1.index, 2),
columns=["id1","id2"])
.assign(distance=lambda d:d.apply(lambda r:
levenshtein(df1.loc[r['id1'], 'Name'],
df1.loc[r['id2'], 'Name']),
axis=1))
)
output:
id1 id2 distance
0 0 0 0
1 0 1 4
2 0 2 8
3 1 1 0
4 1 2 4
5 2 2 0
older answer
You could apply a cross merge and filter your new dataframe in the beginning:
new_df = (
df1.rename(columns={'Name': 'id1'})
.merge(df1['Name'].rename('id2'), how='cross')
.loc[lambda d: d['id1'].ne(d['id2'])]
)
output:
Animal id1 Months id2
1 dog Rosie 45 Archie
2 dog Rosie 45 Archibald
3 cat Archie 12 Rosie
5 cat Archie 12 Archibald
6 snake Archibald 14 Rosie
7 snake Archibald 14 Archie
Then you can apply your levenshtein.
For example here calculating the distance (but you can adapt to any metric you want):
from Levenshtein import distance as levenshtein
new_df = (
df1.rename(columns={'Name': 'id1'})
.merge(df1['Name'].rename('id2'), how='cross')
.loc[lambda d: d['id1'].ne(d['id2'])]
.assign(distance=lambda d: d.apply(lambda r: levenshtein(r['id1'], r['id2']), axis=1))
)
output:
Animal id1 Months id2 distance
1 dog Rosie 45 Archie 4
2 dog Rosie 45 Archibald 8
3 cat Archie 12 Rosie 4
5 cat Archie 12 Archibald 4
6 snake Archibald 14 Rosie 8
7 snake Archibald 14 Archie 4
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
