'Calculate cosine similarity and output without duplicates?
I have the following vectors in my toy example:
data = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'a': [55, 2123, -19.3, 9, -8],
'b': [21, -0.1, 0.003, 4, 2.1]
})
I have calculated similarity matrix (by excluding the id column).
from sklearn.metrics.pairwise import cosine_similarity
# Calculate the pairwise cosine similarities
S = cosine_similarity(data.drop('id', axis=1))
T = S.tolist()
df = pd.DataFrame.from_records(T)
It returns me a matrix/dataframe with all options including self similarity and duplicates. Is there any efficient method to calculate similarity without self similarities (vector is 100% similar to itself) and duplicates (vectors 1 and 2 has 89% similarity, I don't need vectors 2 and 1 similarity as it's the same).
Solution 1:[1]
I am struggling with the same problem and the best solution I found so far is to take the lower triangle under the diagonal:
[In] S[np.triu_indices_from(S, k=1)]
[Out] array([ 0.93420158, -0.93416293, 0.99856978, -0.81303909, -0.99999999,
0.91379242, -0.96724292, -0.91374841, 0.96727042, -0.78074903])
What this does is take only those values that are under the 1 diagonal, so basically excluding the ones and the repeating values. This gives you a numpy array, too. However, I am looking for more efficient solutions, since when there is big data this feels like an extra step that occupies a lot of power... Did you find any other solution to this? Cheers!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | tryingtoprogram |
