'R: Correct Way to Calculate Cosine Similarity?

I am working with the R programming language.

I have the following data:

text = structure(list(id = 1:8, reviews = c("I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.", 
"I went to McDonald's and they charge me 50 for Big Mac when I only came with 49. The casher told me that I can't read correctly and told me to get glasses. I am file a report on your casher and now I'm mad.", 
"I really think that if you can buy breakfast anytime then I should be able to get a cheeseburger anytime especially since I really don't care for breakfast food. I really like McDonald's food but I preferred tree lunch rather than breakfast. Thank you thank you thank you.", 
"I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.", 
"Never order McDonald's from Uber or Skip or any delivery service for that matter, most particularly one on Elgin Street and Rideau Street, they never get the order right. Workers at either of these locations don't know how to follow simple instructions. Don't waste your money at these two locations.", 
"Employees left me out in the snow and wouldn’t answer the drive through. They locked the doors and it was freezing. I asked the employee a simple question and they were so stupid they answered a completely different question. Dumb employees and bad food.", 
"McDonalds food was always so good but ever since they add new/more crispy chicken sandwiches it has come out bad. At first I thought oh they must haven't had a good day but every time I go there now it's always soggy, and has no flavor. They need to fix this!!!", 
"I just ordered the new crispy chicken sandwich and I'm very disappointed. Not only did it taste horrible, but it was more bun than chicken. Not at all like the commercial shows. I hate sweet pickles and there were two slices on my sandwich. I wish I could add a photo to show the huge bun and tiny chicken."
)), class = "data.frame", row.names = c(NA, -8L))

I would like to calculate a matrix of cosine similarities between each pair of elements:

library(lsa)
library(proxy)
library(tm)

text = text[,2]

corpus <- VCorpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus, 
    control = list(wordLengths = c(1, Inf)))
occurrence <- apply(X = tdm, 
    MARGIN = 1, 
    FUN = function(x) sum(x > 0) / ncol(tdm))

tdm_mat <- as.matrix(tdm[names(occurrence)[occurrence >= 0.5], ])

lsaSpace <- lsa(tdm_mat)

# lsaMatrix now is a k x (num doc) matrix, in k-dimensional LSA space
lsaMatrix <- diag(lsaSpace$sk) %*% t(lsaSpace$dk)

# Use the `cosine` function in `lsa` package to get cosine similarities matrix
# (subtract from 1 to get dissimilarity matrix)
distMatrix <- 1 - cosine(lsaMatrix)

When looking at the resulting matrix:

 distMatrix
             1           2         3            4          5          6            7           8
1 0.000000e+00 0.006362649 0.2616818 0.000000e+00 0.06794855 0.25138506 3.107289e-05 0.003658840
2 6.362649e-03 0.000000000 0.1904180 6.362649e-03 0.11468650 0.33082042 5.505664e-03 0.019623883
3 2.616818e-01 0.190417963 0.0000000 2.616818e-01 0.55622109 0.89444938 2.563879e-01 0.322025370
4 0.000000e+00 0.006362649 0.2616818 0.000000e+00 0.06794855 0.25138506 3.107289e-05 0.003658840
5 6.794855e-02 0.114686503 0.5562211 6.794855e-02 0.00000000 0.06202843 7.083380e-02 0.040392530
6 2.513851e-01 0.330820421 0.8944494 2.513851e-01 0.06202843 0.00000000 2.566349e-01 0.197460291
7 3.107289e-05 0.005505664 0.2563879 3.107289e-05 0.07083380 0.25663492 0.000000e+00 0.004363538
8 3.658840e-03 0.019623883 0.3220254 3.658840e-03 0.04039253 0.19746029 4.363538e-03 0.000000000

My Question: Have I calculated the cosine similarity correctly? Is there another way to do this?

Thank you!

References:



Solution 1:[1]

First of all, I would suggest using cosine instead of 1-cosine, because this reads easier. Using your code to calculate the cosine similarity:

library(lsa)
library(proxy)
library(tm)

text = text[,2]

corpus <- VCorpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus, 
                          control = list(wordLengths = c(1, Inf)))
occurrence <- apply(X = tdm, 
                    MARGIN = 1, 
                    FUN = function(x) sum(x > 0) / ncol(tdm))

tdm_mat <- as.matrix(tdm[names(occurrence)[occurrence >= 0.5], ])

lsaSpace <- lsa(tdm_mat)

lsaMatrix <- diag(lsaSpace$sk) %*% t(lsaSpace$dk)

distMatrix <- cosine(lsaMatrix)
round(distMatrix, 3)

Output:

      1     2     3     4     5     6     7     8
1 1.000 0.994 0.738 1.000 0.932 0.749 1.000 0.996
2 0.994 1.000 0.810 0.994 0.885 0.669 0.994 0.980
3 0.738 0.810 1.000 0.738 0.444 0.106 0.744 0.678
4 1.000 0.994 0.738 1.000 0.932 0.749 1.000 0.996
5 0.932 0.885 0.444 0.932 1.000 0.938 0.929 0.960
6 0.749 0.669 0.106 0.749 0.938 1.000 0.743 0.803
7 1.000 0.994 0.744 1.000 0.929 0.743 1.000 0.996
8 0.996 0.980 0.678 0.996 0.960 0.803 0.996 1.000

Your matrix looks good. You can see that when the value is 1 the documents are similar and 0 not. To check if your similarities are right, you can use the stringsim function from the stringdist package. Let's compare reviews 1 and 4 from your dataset using this code:

library(stringdist)
stringsim(text[1,2], text[4,2], method = "cosine")

Output:

[1] 1

As you can see the output is 1 which is the same as your matrix.

Solution 2:[2]

You could use the text-package in R (http://www.r-text.org/), and use contextualised word embeddings from transformer based language models such as BERT.

For installation guidelines see http://www.r-text.org/articles/Extended_Installation_Guide.html

library(text)
# Embed the text
text_embeddings <- text::textEmbed(text)

# Compute cosine based semantic similarity between first and second row
textSimilarity(text_embeddings$reviews[1,],
               text_embeddings$reviews[2,])


# A function that computes semantic similarity between all combinations 
# and return a matrix of semantic similarity scores.  
textSimilarityMatrix <- function(embedding){
  
  ss_matrix <- matrix(nrow=nrow(embedding), ncol=nrow(embedding))
  
  for (i in seq_len(nrow(embedding))){
    
    for (j in seq_len(nrow(embedding))){
      ss_matrix[i,j] <- text::textSimilarity(embedding[i,],
                                            embedding[j,])
    }
  }
  ss_matrix
}
# Run function
ss_matrix_fun <- textSimilarityMatrix(embedding)
round(ss_matrix_fun, 3)

Solution 3:[3]

I don't think it's weird to have some of the entries greater than 1. I've found one source that are relevant to this question and there're some useful links within it.

If you directly calculate the cosine similarity using cosine() among every two texts, you would get the similarity score within the range of 0 and 1(In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies cannot be negative.). This is because the frequency count of words in each text are greater or equal to 0, vectors formed by the frequency count are in the 1st 'quadrant' of the hyperspace.

However, when you use Latent Semantic Analysis (Details of Page 11) first and then calculate the cosine similarity, you are trying to compress the multidimensional information into the lower dimension information and you can't guarantee those vectors are in the same 'quadrant'. For your case, you are trying to compress vectors formed by the frequency count to vectors of length two.

To verify this, you could check lsaMatrix, and you will find that vectors formed by each column are not inside the same quadrant. The reason why you still get positive cosine similarity is because angles between all those vectors are less than 90 degree.

> lsaMatrix 
              1          2         3          4          5         6          7          8
[1,] -8.9936599 -5.7764007 -3.835576 -8.9936599 -1.8369879 -5.311483 -3.7632129 -5.6634011
[2,] -0.3266251 -0.8681005 -3.768100 -0.3266251  0.6383841  4.370599 -0.1663843  0.2792518

Therefore, if you want cosine similarity within the range of 0 and 1, I suggest you to run the code below,

text = text[,2]

corpus <- VCorpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus, 
                          control = list(wordLengths = c(1, Inf)))
occurrence <- apply(X = tdm, 
                    MARGIN = 1, 
                    FUN = function(x) sum(x > 0) / ncol(tdm))

tdm_mat <- as.matrix(tdm[names(occurrence)[occurrence >= 0.5], ])

# Use the `cosine` function in `lsa` package to get cosine similarities matrix
# (subtract from 1 to get dissimilarity matrix)
distMatrix <- 1 - cosine(tdm_mat)

Another thing I would like to mention is that you have deleted some less frequent word count.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Quinten
Solution 2 Gorp
Solution 3