'R: LSA and document similarity with very large tdm

I have a tdm with 16k x 350k dimension, for which I am trying to get the document-document similarity (cosine type). With RSpectra I have found a way to find the svd with the 1500 singular values, yet I cannot proceed computing cosine of svt_mat because of memory (1150Gb nedeed).

library(RSpectra)
space <- svds(sparse_tdm, 1500)
names(space)
su_mat <- space$d * space$u
svt_mat <- space$d * Matrix::t(space$v)
(distMatrix <- cosine(svt_mat)) ### error

I thought to modify svt_mat in the following way:

svt_mat[svt_mat < .01 & (svt_mat > -.01) ] <- 0
sparsematrix <- as(svt_mat, "sparseMatrix")
sparsematrix <- triu(sparsematrix)
summ <- summary(sparsematrix)
#normalize by euclidean norm
dt <- data.table(i = summ$i,
                 j = summ$j,
                 v=as.numeric(summ$x))[, v2 := v/sqrt(sum(v^2)), j]

And then run on this the function written by @jblood94 here

I have three questions:

  1. Would the transformation svt_mat[svt_mat < .01 & (svt_mat > -.01) ] <- 0 drastically alter the quality of lsa?
  2. Is it expected that entries in svt_mat have abs. values larger than 1?
  3. Is there a way (maybe processing the cosine in chunks and storing temps) to have the cosine product (and then eventually filter the values below a certain threshold)?


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source