'Calculate cosine similarity and output without duplicates?

I have the following vectors in my toy example:

data = pd.DataFrame({
            'id': [1, 2, 3, 4, 5],
            'a': [55, 2123, -19.3, 9, -8], 
            'b': [21, -0.1, 0.003, 4, 2.1]
        })

I have calculated similarity matrix (by excluding the id column).

from sklearn.metrics.pairwise import cosine_similarity

# Calculate the pairwise cosine similarities 
S = cosine_similarity(data.drop('id', axis=1))

T  = S.tolist()
df = pd.DataFrame.from_records(T)

It returns me a matrix/dataframe with all options including self similarity and duplicates. Is there any efficient method to calculate similarity without self similarities (vector is 100% similar to itself) and duplicates (vectors 1 and 2 has 89% similarity, I don't need vectors 2 and 1 similarity as it's the same).

Solution 1:^[1]

I am struggling with the same problem and the best solution I found so far is to take the lower triangle under the diagonal:

[In] S[np.triu_indices_from(S, k=1)]

[Out] array([ 0.93420158, -0.93416293,  0.99856978, -0.81303909, -0.99999999,
    0.91379242, -0.96724292, -0.91374841,  0.96727042, -0.78074903])

What this does is take only those values that are under the 1 diagonal, so basically excluding the ones and the repeating values. This gives you a numpy array, too. However, I am looking for more efficient solutions, since when there is big data this feels like an extra step that occupies a lot of power... Did you find any other solution to this? Cheers!

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	tryingtoprogram

'Calculate cosine similarity and output without duplicates?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]