'python - finding cosine similarity between two groups of vectors in the most efficient way

I am trying to calculate the average cosine similarity between two groups of sentences. After converting the sentences to embeddings, I need to calculate avg. similarities in the most efficient way. Here, what I have tried and the time is taken. Is there any way to improve this calculation?

I also put the slower approaches to help you to understand the case.

X_num is the pandas.Series of embedding vectors, each row is generally 512, 768, 1024 or 2048 long.

class1_indexes : all the indexes of the instances that belong to class 1.
class2_indexes : all the indexes of the instances that belong to class 2.

I need to calculate cosine similarities between each vector pair from class 1 and class 2. Totally, my output should be a cosine similarity vector of len(class1_indexes)*len(class2_indexes ) long.
I edited the code as it includes the test case and you can see that run times for the approaches are like:

t1 > t2 > t3

The third approach is faster 20x times. But I'm looking for much faster approaches.

Thanks in advance.

sample_in_each_class = 1000
X_num = pd.Series([np.random.random(10) for i in range(2*sample_in_each_class)])
class1_indexes = list(range(sample_in_each_class))
class2_indexes = list(range(sample_in_each_class,2*sample_in_each_class))

approach 1

def cosine_similarity(vec1, vec2):
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    if norm1 == 0:
        norm1 += 0.00001
    if norm2 == 0:
        norm2 += 0.00001  
    return np.dot(vec1, vec2)/(norm1*norm2)
    
approach1 = []
for idx1 in class1_indexes:
    for idx2 in class2_indexes:
        approach1.append(cosine_similarity(X_num.loc[idx1], X_num.loc[idx2]))

approach 2

import itertools
vectors_product = itertools.product(X_num[class1_indexes], X_num[class2_indexes])
vectors_product = pd.Series(list(vectors_product))
approach2 = vectors_product.apply(lambda x: cosine_similarity(x[0], x[1]))

approach 3

vectors_product = itertools.product(X_num[class1_indexes], X_num[class2_indexes])
vectors_product = np.array(list(vectors_product))
first_part = vectors_product[:,0,:]
second_part = vectors_product[:,1,:]

numerator = np.sum(np.multiply(first_part, second_part), axis=1)
denominator = (np.multiply(np.linalg.norm(first_part, axis=1), 
                           np.linalg.norm(second_part, axis=1)))
approach3 = numerator / denominator
                  


Solution 1:[1]

I think the most efficient way is:

  1. Convert your data to two numpy ndarrays, an array for each group. I assume rows are embedding vectors and arrays are called X1 and X2.
  2. Do X/np.linalg.norm(X, axis=1) on each array.
  3. Then do np.dot(X1, X2.T)
  4. Then flatten the resulting matrix if you want.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Viacheslav Prokopev