'pairwise similarity with consecutive points
I have a large matrix of document similarity created with paragraph2vec_similarity in doc2vec package. I converted it to a data frame and added a TITLE column to the beginning to later sort or group it.
Current Dummy Output:
| Title | Header | DocName_1900.txt_1 | DocName_1900.txt_2 | DocName_1900.txt_3 | DocName_1901.txt_1 | DocName_1901.txt_2 |
|---|---|---|---|---|---|---|
| Doc1 | DocName_1900.txt_1 | 1.000000 | 0.7369358 | 0.6418045 | 0.6268959 | 0.6823404 |
| Doc1 | DocName_1900.txt_2 | 0.7369358 | 1.000000 | 0.6544884 | 0.7418507 | 0.5174367 |
| Doc1 | DocName_1900.txt_3 | 0.6418045 | 0.6544884 | 1.000000 | 0.6180578 | 0.5274650 |
| Doc2 | DocName_1901.txt_1 | 0.6268959 | 0.7418507 | 0.6180578 | 1.000000 | 0.5755243 |
| Doc2 | DocName_1901.txt_2 | 0.6823404 | 0.5174367 | 0.5274650 | 0.5755243 | 1.000000 |
What I want is a data frame giving similarity in consecutive order for each following document. That is, the score for Doc1.1 and Doc1.2; and Doc1.2 and Doc1.3. Because I am only interested with similarity scores inside each individual document -- in diagonal order as shown in bold above.
Expected Output
| Title | Similarity for 1-2 | Similarity for 2-3 | Similarity for 3-4 | |
|---|---|---|---|---|
| Doc1 | 0.7369358 | 0.6544884 | NA | |
| Doc2 | 0.5755243 | NA | NA | NA |
| Doc3 | 0.6049844 | 0.5250659 | 0.5113757 |
I was able to produce one giving the similarity scores of one doc with the remaining all docs with x<-data.frame(col=colnames(m)[col(m)], row=rownames(m)[row(last)], similarity=c(m)). This is the closest I could get. Is there a better way? Because I am dealing with more than 500 titles with varying lengths. There is still the option of using diag but it gets everything to the end of matrix and I loose document grouping.
Solution 1:[1]
Another solution:
df %>%
group_by(Title) %>%
summarize(name = embed(Header, 2), .groups = 'drop') %>%
mutate(value = transform(df, row.names = Header)[name],
name = str_remove_all(paste(name[,2],name[,1], sep = '_'), '[^_]+[.]'))%>%
pivot_wider()
# A tibble: 2 x 3
Title `1_2` `2_3`
<chr> <chr> <chr>
1 Doc1 0.7369358 0.6544884
2 Doc2 0.5755243 NA
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | onyambu |
