'Create non-duplicate cross product with custom column name

Consider this dataframe:

  Animal       Name Months
0    dog      Rosie     45
1    cat     Archie     12
2  snake  Archibald     14

Using this code

new_df = pd.DataFrame(product(df1['Name'], df1['Name']), columns=["id1","id2"])
new_df["LevScore"] = new_df.apply(lambda x: levenshtein(x[0],x[1]), axis=1)

I'm able to create this table:

         id1        id2  LevScore
0      Rosie      Rosie         0
1      Rosie     Archie         4
2      Rosie  Archibald         8
3     Archie      Rosie         4
4     Archie     Archie         0
5     Archie  Archibald         4
6  Archibald      Rosie         8
7  Archibald     Archie         4
8  Archibald  Archibald         0

Which contains duplicate pairs (e.g. Rosie-Archie, Archie-Rosie)

Thus, I'm able to remove duplicate pairs using

new_df = pd.DataFrame(combinations_with_replacement(df1['Name'], 2), columns=["id1","id2"])
new_df["LevScore"] = new_df.apply(lambda x: levenshtein(x[0],x[1]), axis=1)

and I get this:

         id1        id2  LevScore
0      Rosie      Rosie         0
1      Rosie     Archie         4
2      Rosie  Archibald         8
4     Archie     Archie         0
5     Archie  Archibald         4
8  Archibald  Archibald         0

Next, instead of writing the full name of each animal in id1 and id2, I would like to just have their row index. Rosie's row is on index 0, Archie's is on index 1, and Archibald's is on index 2. Thus, I would like to create the following table

    id1  id2  LevScore
0   0    0    0
1   0    1    4
2   0    1    8
4   1    1    0
5   1    2    4
8   2    2    0

Now, I would like to write to multiple columns. Say that the levenshtein function returns two values (distance and similarity score), I would like to write these two values to two different columns, such as this:

    id1  id2  LevScore  Similarity
0   0    0    0         1.00
1   0    1    4         0.45
2   0    1    8         0.32
4   1    1    0         1.00
5   1    2    4         0.89
8   2    2    0         1.00

Next, I would like to construct this dataframe so it only writes to it under a certain condition, such as that 'LevScore' < 6, in which case I would get the following dataframe:

    id1  id2  LevScore  Similarity
0   0    0    0         1.00
1   0    1    4         0.45
4   1    1    0         1.00
5   1    2    4         0.89
8   2    2    0         1.00

Lastly, I would like to not have texts (in this case, values in 'Name' column) consider themselves (because the Levenshtein distance will always be 0, which is useless information), so I would only really be left with:

    id1  id2  LevScore  Similarity
1   0    1    4         0.45
5   1    2    4         0.89

Is there any way to combine all of these into one efficient algorithm? I have a large dataset, where efficiency is really crucial. Thank you in advance!



Solution 1:[1]

Using your approach, you can directly use the index in combinations and reference the first dataframe in your apply:

from itertools import combinations_with_replacement
from Levenshtein import distance as levenshtein

(pd.DataFrame(combinations_with_replacement(df1.index, 2),
             columns=["id1","id2"])
   .assign(distance=lambda d:d.apply(lambda r:
                                     levenshtein(df1.loc[r['id1'], 'Name'],
                                                 df1.loc[r['id2'], 'Name']),
                                     axis=1))
)

output:

   id1  id2  distance
0    0    0         0
1    0    1         4
2    0    2         8
3    1    1         0
4    1    2         4
5    2    2         0
older answer

You could apply a cross merge and filter your new dataframe in the beginning:

new_df = (
 df1.rename(columns={'Name': 'id1'})
    .merge(df1['Name'].rename('id2'), how='cross')
    .loc[lambda d: d['id1'].ne(d['id2'])]
)

output:

  Animal        id1  Months        id2
1    dog      Rosie      45     Archie
2    dog      Rosie      45  Archibald
3    cat     Archie      12      Rosie
5    cat     Archie      12  Archibald
6  snake  Archibald      14      Rosie
7  snake  Archibald      14     Archie

Then you can apply your levenshtein.

For example here calculating the distance (but you can adapt to any metric you want):

from Levenshtein import distance as levenshtein
new_df = (
 df1.rename(columns={'Name': 'id1'})
    .merge(df1['Name'].rename('id2'), how='cross')
    .loc[lambda d: d['id1'].ne(d['id2'])]
    .assign(distance=lambda d: d.apply(lambda r: levenshtein(r['id1'], r['id2']), axis=1))
)

output:

  Animal        id1  Months        id2  distance
1    dog      Rosie      45     Archie         4
2    dog      Rosie      45  Archibald         8
3    cat     Archie      12      Rosie         4
5    cat     Archie      12  Archibald         4
6  snake  Archibald      14      Rosie         8
7  snake  Archibald      14     Archie         4

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1