'identity matrix using dask in memory-efficient way
Do You guys know any idea how to create an Identity Matrix using Dask? Without using numpy.identity()? I really care about memory consumption
Solution 1:[1]
Dask supports sparse arrays backed by the sparse library, a pydata project implementing an ND sparse array. From the project description:
This implements sparse arrays of arbitrary dimension on top of numpy and scipy.sparse. It generalizes the scipy.sparse.coo_matrix and scipy.sparse.dok_matrix layouts, but extends beyond just rows and columns to an arbitrary number of dimensions.
Additionally, this project maintains compatibility with the numpy.ndarray interface rather than the numpy.matrix interface used in scipy.sparse
These differences make this project useful in certain situations where scipy.sparse matrices are not well suited, but it should not be considered a full replacement. The data structures in pydata/sparse complement and can be used in conjunction with the fast linear algebra routines inside scipy.sparse. A format conversion or copy may be required.
You can create a sparse.COO identity array easily with sparse.eye. For example, here's we can create a 1e6 x 1e6 identity matrix, which would occupy ~1TB as a dense array but here occupies just over 17MB:
In [11]: ident = sparse.eye(N=int(1e6), dtype='int8')
In [12]: ident.nbytes
Out[12]: 17000000
This can be indexed like a normal array:
In [13]: ident[-8:-1, -8:-1]
Out[13]: <COO: shape=(7, 7), dtype=int8, nnz=7, fill_value=0>
In [14]: ident[-8:-1, -8:-1].todense()
Out[14]:
array([[1, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 1]], dtype=int8)
In [15]: ident.shape
Out[15]: (1000000, 1000000)
Dask can work with this array type seamlessly - see the dask docs on working with sparse arrays. To convert the above array to a dask array, simply:
In [16]: da = dask.array.from_array(ident)
In [17]: da
Out[17]: dask.array<array, shape=(1000000, 1000000), dtype=int8, chunksize=(10000, 10000), chunktype=sparse.COO>
Note, however, that some operations would cause this array to become dense, and potentially explode. The sparse library attempts to keep you safe by raising a ValueError for any operations that would cause "auto-densification", but it's probably worth giving the sparse docs on Operations on COO and GCXS arrays a close read.
Additionally - note that Dask arrays don't take up less space - in fact, they take slightly more space, compared to their in-memory counterparts. Dask allows you to leverage multiple threads/processes or servers to either work with partitioned data stored on disk or using a distributed processing model, but it cannot magically save you memory.
Tl;dr you still have to be careful when performing large and especially sparse operations with dask. It just might be slightly faster (for you - not in terms of CPU time).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
