'Scipy: Sparse matrix multiplication memory error

I want to perform matrix multiplication between a sparse matrix and its transpose, (their are big matrices). Specifically, I have:

C = csc_matrix(...)
Ct = csc_matrix.transpose(C)
L = Ct*C

and shapes:

C.shape
(1791489, 28508141)
Ct.shape
(28508141, 1791489)

And I am getting the following error:

Traceback (most recent call last):

  File "C:\...\modularity.py", line 373, in <module>
    L = Ct*C

  File "C:\...\anaconda3\lib\site-packages\scipy\sparse\base.py", line 480, in __mul__
    return self._mul_sparse_matrix(other)

  File "C:\...\anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 518, in _mul_sparse_matrix
    indices = np.empty(nnz, dtype=idx_dtype)

MemoryError: Unable to allocate 1.11 TiB for an array with shape (152087117507,) and data type int64

I cannot figure out why, why does it try to allocate memory for such a huge array ?

Update: Currently I am trying to do the multiplication in chunks like this

chunksize=1000
numiter = Ct.shape[0]//chunksize
blocks=[]
for i in range(numiter):
    A = Ct[i*chunksize:(i+1)*chunksize].dot(C)
    blocks.append(A)

But I get:

MemoryError: Unable to allocate 217. MiB for an array with shape (57012620,) and data type int32


Solution 1:[1]

For future viewers who want to multiply huge sparse matrices I solved my problem using PyTables and saved the result of the multiplication in chunks. Still it creates a big file but at least is compressed. The code I used goes like this:

import tables as tb

f = tb.open_file('D:\dot.h5', 'w')
l, m, n = Ct.shape[0], Ct.shape[1], C.shape[1]
filters = tb.Filters(complevel=8, complib='blosc')
out_data = f.create_earray(f.root, 'data', tb.Int32Atom(), shape=(0,), filters=filters)
out_indices = f.create_earray(f.root, 'indices', tb.Int32Atom(),shape=(0,), filters=filters)
out_indptr = f.create_earray(f.root, 'indptr', tb.Int32Atom(), shape=(0,), filters=filters)
out_indptr.append(np.array([0])) #this is needed as a first indptr
max_indptr = 0
#buffersize
bl = 10000
for i in range(0, l, bl):
 res = Ct[i:min(i+bl, l),:].dot(C)
 out_data.append(res.data)
 indices = res.indices
 indptr = res.indptr
 out_indices.append(indices)
 out_indptr.append(max_indptr+indptr[1:])
 max_indptr += indices.shape[0]

So if for example you want access to the 2nd row of your final matrix you simply can:

L2 = csr_matrix((a.data[a.indptr[2]:a.indptr[2+1]], a.indices[a.indptr[2]:a.indptr[2+1]], np.array([0,len(a.indices[a.indptr[2]:a.indptr[2+1]])])), shape=(1,n))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 George