'create a fast custom similarity matrix in python

I am trying to build a similarity matrix while using a custom similarity function. The problem is that the code runs very slow.

I have a dataframe which looks like this:

 col1    col2   col3
 'car'   'A'   'cat'
 'car'   'C'   'dog'
 'bike'  'A'   'cat'
 ...

and I have a series of weights which attribute importance to a certain column [0.1, 0.5, 0.4]

I want to compute similarity between rows in a custom similarity matrix where pairs of rows are similar if they have the same values (given the weights which make some columns more important than others)

My current similarity takes as an input two arrays and checks how many elements are identical between them using some weights (which is an array with the same length as x and y)

def custom_similarity(x, y, weights):
    
    similarity = np.dot((x == y).values*1,weights)
    return(similarity)

given a dataframe where each row represents one of the array to compare I would like to generate a similarity matrix of the dataframe using the function.

at the moment I am doing something like this (so filling an empty matrix), which it works but it is super slow:

sim_matrix = np.zeros((len(df),len(df)))
    
for i in tqdm(range(len(df))):
    obs_i = df.iloc[i,:]
    for j in range(i, len(df)):
        obs_j = df.iloc[j,:]
        sim_matrix[i,j] = sim_matrix[j,i] = custom_similarity(obs_i, obs_j, weights)

how can I make this more efficient and speed it up?



Solution 1:[1]

One way is to use scipy.spatial. That is already a lot more efficient than what you have rolled yourself. In particular, you could do the following, using pdist and a custom metric function:

import numpy as np
from scipy.spatial.distance import pdist, squareform


def sim_mat(df, weights):
    mat = squareform(pdist(df.values, metric=lambda x, y: (x == y) @ weights))
    np.fill_diagonal(mat, sum(weights))

    return mat

Comparing this approach to your original method on datasets of increasing size, I obtain the following results:

comparison

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1