'calculation the similarity by using Jaccard Index Python

I want to use Jaccard Index to find the similarity among elements of the dataframe (user_choices).

import scipy.spatial
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

user_choices = [[1, 0, 0, 1, 0, 1], 
                [0, 1, 0, 0, 0, 0], 
                [0, 0, 1, 0, 0, 0],
                [1, 0, 0, 1, 0, 1],
                [0, 0, 0, 0, 1, 0],
                [1, 0, 0, 1, 0, 1]]
df_choices = pd.DataFrame(user_choices, columns=["User A", "User B", "User C", "User D", "User E", "User F"], 
                          index=(["User A", "User B", "User C", "User D", "User E", "User F"]))

df_choices

enter image description here

I wrote this code to calculate a Jaccard Index for my data:

jaccard = (1-scipy.spatial.distance.cdist(df_choices, df_choices,  
                                       metric='jaccard'))
user_distance = pd.DataFrame(jaccard, columns=df_choices.index.values,  
                             index=df_choices.index.values)

user_distance

But These are the outputs, which are identical to my data!

enter image description here



Solution 1:[1]

If I understand correctly you want user_distance[i,j] = jaccard-distance(df_choices[i], df_choices[j])

You can get this in two steps (1) calculate the pairs distance, this will get the distance for ordered pairs (2) obtain the square form from the condensed distance matrix.

jaccard = scipy.spatial.distance.pdist(df_choices, 'jaccard')
user_distances = pd.DataFrame(1-scipy.spatial.distance.squareform(jaccard), 
                              columns=df_choices.index.values,  
                              index=df_choices.index.values)

You have a symmetric matrix so the distance matrix is expected to be symmetric

For any pair of rows in your matrix there the elements are either all equal or all different, so the output matrix will have only ones and zeros.

if you try the same code with the following example

user_choices = [[1, 0, 0, 3, 0, 4], 
                [0, 1, 0, 0, 0, 0], 
                [0, 0, 1, 0, 0, 0],
                [1, 0, 0, 1, 0, 1],
                [0, 0, 0, 0, 1, 0],
                [1, 0, 0, 1, 0, 1]]

You will have output different from the input.

Solution 2:[2]

  • The Jaccard distance from eg user F with row vector (1, 0, 0, 1, 0, 1) to user A is zero; and you compute 1 - scipy.spatial.distance.cdist(...) = 1.

  • The Jaccard distance from eg. user E with row vector (0, 0, 0, 0, 1, 0) to user A is one; you compute 1 - 1 = 0.

>>> print(scipy.spatial.distance.jaccard(user_choices[0], user_choices[5]))
0.0
>>> print(scipy.spatial.distance.jaccard(user_choices[0], user_choices[4]))
1.0

You have perhaps accidentally arrived at some input that is identical to its own distance matrix when using Jaccard distance as a metric, minus one.

Maybe you don't want that (1-...) there?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2