'Dask map_blocks is running earlier with a bad result for overlap and nested procedures
I'm using Dask to create a simple pipeline of data manipulation. I'm basically using 3 functions. The first two uses a simple map_blocks and the third one uses a map_blocks also but for an overlapped data.
For some reason, the third map_blocks is executing earlier than I want. See the code and the ipython output (without executing run():
data = np.arange(2000)
data_da = da.from_array(data, chunks=(500,))
def func1(block, block_info=None):
return block + 1
def func2(block, block_info=None):
return block * 2
def func3(block, block_info=None):
print("func3", block_info)
return block
data_da_1 = data_da.map_blocks(func1)
data_da_2 = data_da_1.map_blocks(func2)
data_da_over = da.overlap.overlap(data_da_2, depth=(1), boundary='periodic')
data_da_map = data_da_over.map_blocks(func3)
data_da_3 = da.overlap.trim_internal(data_da_map, {0: 1})
The output is:
func3 None
func3 None
It is still not respecting the number of blocks which is 4 here.
I really don't know what is wrong with this code. Specially because if I use visualize() to see the data graph, It builds the right data sequence I want.
Initially, I thought that overlap requires compute() before like rechunk does, but I've already tested that too.
Solution 1:[1]
Like many dask operations, da.overlap operations can either be passed a meta argument specifying the output types and dimensions, or dask will execute the function with a small (or length zero) subset of the data.
From the dask.array.map_overlap docs:
Note that this function will attempt to automatically determine the output array type before computing it, please refer to the meta keyword argument in map_blocks if you expect that the function will not succeed when operating on 0-d arrays.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Michael Delgado |
