'R: LSA and document similarity with very large tdm
I have a tdm with 16k x 350k dimension, for which I am trying to get the document-document similarity (cosine type). With RSpectra I have found a way to find the svd with the 1500 singular values, yet I cannot proceed computing cosine of svt_mat because of memory (1150Gb nedeed).
library(RSpectra)
space <- svds(sparse_tdm, 1500)
names(space)
su_mat <- space$d * space$u
svt_mat <- space$d * Matrix::t(space$v)
(distMatrix <- cosine(svt_mat)) ### error
I thought to modify svt_mat in the following way:
svt_mat[svt_mat < .01 & (svt_mat > -.01) ] <- 0
sparsematrix <- as(svt_mat, "sparseMatrix")
sparsematrix <- triu(sparsematrix)
summ <- summary(sparsematrix)
#normalize by euclidean norm
dt <- data.table(i = summ$i,
j = summ$j,
v=as.numeric(summ$x))[, v2 := v/sqrt(sum(v^2)), j]
And then run on this the function written by @jblood94 here
I have three questions:
- Would the transformation
svt_mat[svt_mat < .01 & (svt_mat > -.01) ] <- 0drastically alter the quality of lsa? - Is it expected that entries in svt_mat have abs. values larger than 1?
- Is there a way (maybe processing the cosine in chunks and storing temps) to have the cosine product (and then eventually filter the values below a certain threshold)?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
