'Dask store/read a sparse matrix that doesn't fit in memory
I'm using sparse to construct, store, and read a large sparse matrix. I'd like to use Dask arrays to use its blocked algorithms features.
Here's a simplified version of what I'm trying to do:
file_path = './{}'.format('myfile.npz')
if os.path.isfile(file_path):
# Load file with sparse matrix
X_sparse = sparse.load_npz(file_path)
else:
# All matrix elements are initially equal to 0
coords, data = [], []
X_sparse = sparse.COO(coords, data, shape=(88506, 1440000))
# Create file for later retrieval
sparse.save_npz(file_path, X_sparse)
# Create Dask array from matrix to allow usage of blocked algorithms
X = da.from_array(X_sparse, chunks='auto').map_blocks(sparse.COO)
return X
Unfortunately, the code above throws the following error when trying to use compute() with X: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.; but I cannot transform the sparse matrix to dense in memory, as it will result in an error.
Any ideas in how to accomplish this?
Solution 1:[1]
You can have a look at the following issue: https://github.com/dask/dask/issues/4523
Basically, sparse by intention prevents automatic conversion into a dense matrix.
However, by setting the environment variable SPARSE_AUTO_DENSIFY=1 you can override this behavior. Nevertheless, this only solves the bug but does not accomplish your main goal.
What you would need to do is to split your file into multiple *.npz sparse matrices, load these with sparse in a delayed manner (see dask.delayed) and concatenate those into one large sparse Dask array.
I will have to implement something like this in the near future. IMHO this should be supported by Dask more natively...
Solution 2:[2]
dask.array.from_array now supports COO and GCXS sparse arrays natively.
Using dask version '2022.01.0':
In [18]: # All matrix elements are initially equal to
...: coords, data = [], []
...: X_sparse = sparse.COO(coords, data, shape=(88506, 1440000))
...:
...: # Create Dask array from matrix to allow usage of blocked algorithms
...: X = dask.array.from_array(X_sparse, chunks="auto").map_blocks(sparse.COO)
In [19]: X
Out[19]: dask.array<COO, shape=(88506, 1440000), dtype=float64, chunksize=(4023, 4000), chunktype=sparse.COO>
See the dask docs on Sparse Arrays for more information.
Support for sparse arrays was added way back in 2017; stability & API support has been steadily improving ever since.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Hoeze |
| Solution 2 | Michael Delgado |
