'python - finding cosine similarity between two groups of vectors in the most efficient way
I am trying to calculate the average cosine similarity between two groups of sentences. After converting the sentences to embeddings, I need to calculate avg. similarities in the most efficient way. Here, what I have tried and the time is taken. Is there any way to improve this calculation?
I also put the slower approaches to help you to understand the case.
X_num is the pandas.Series of embedding vectors, each row is generally 512, 768, 1024 or 2048 long.
class1_indexes : all the indexes of the instances that belong to class 1.
class2_indexes : all the indexes of the instances that belong to class 2.
I need to calculate cosine similarities between each vector pair from class 1 and class 2. Totally, my output should be a cosine similarity vector of len(class1_indexes)*len(class2_indexes ) long.
I edited the code as it includes the test case and you can see that run times for the approaches are like:
t1 > t2 > t3
The third approach is faster 20x times. But I'm looking for much faster approaches.
Thanks in advance.
sample_in_each_class = 1000
X_num = pd.Series([np.random.random(10) for i in range(2*sample_in_each_class)])
class1_indexes = list(range(sample_in_each_class))
class2_indexes = list(range(sample_in_each_class,2*sample_in_each_class))
approach 1
def cosine_similarity(vec1, vec2):
norm1 = np.linalg.norm(vec1)
norm2 = np.linalg.norm(vec2)
if norm1 == 0:
norm1 += 0.00001
if norm2 == 0:
norm2 += 0.00001
return np.dot(vec1, vec2)/(norm1*norm2)
approach1 = []
for idx1 in class1_indexes:
for idx2 in class2_indexes:
approach1.append(cosine_similarity(X_num.loc[idx1], X_num.loc[idx2]))
approach 2
import itertools
vectors_product = itertools.product(X_num[class1_indexes], X_num[class2_indexes])
vectors_product = pd.Series(list(vectors_product))
approach2 = vectors_product.apply(lambda x: cosine_similarity(x[0], x[1]))
approach 3
vectors_product = itertools.product(X_num[class1_indexes], X_num[class2_indexes])
vectors_product = np.array(list(vectors_product))
first_part = vectors_product[:,0,:]
second_part = vectors_product[:,1,:]
numerator = np.sum(np.multiply(first_part, second_part), axis=1)
denominator = (np.multiply(np.linalg.norm(first_part, axis=1),
np.linalg.norm(second_part, axis=1)))
approach3 = numerator / denominator
Solution 1:[1]
I think the most efficient way is:
- Convert your data to two numpy ndarrays, an array for each group. I assume rows are embedding vectors and arrays are called X1 and X2.
- Do
X/np.linalg.norm(X, axis=1)on each array. - Then do
np.dot(X1, X2.T) - Then flatten the resulting matrix if you want.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Viacheslav Prokopev |
