'Dask map_blocks is running earlier with a bad result for overlap and nested procedures

I'm using Dask to create a simple pipeline of data manipulation. I'm basically using 3 functions. The first two uses a simple map_blocks and the third one uses a map_blocks also but for an overlapped data.

For some reason, the third map_blocks is executing earlier than I want. See the code and the ipython output (without executing run():

data = np.arange(2000)

data_da = da.from_array(data, chunks=(500,))

def func1(block, block_info=None):
    return block + 1

def func2(block, block_info=None):
    return block * 2

def func3(block, block_info=None):
    print("func3", block_info)
    return block

data_da_1 = data_da.map_blocks(func1)

data_da_2 = data_da_1.map_blocks(func2)

data_da_over = da.overlap.overlap(data_da_2, depth=(1), boundary='periodic')

data_da_map = data_da_over.map_blocks(func3)

data_da_3 = da.overlap.trim_internal(data_da_map, {0: 1})

The output is:

func3 None
func3 None

It is still not respecting the number of blocks which is 4 here.

I really don't know what is wrong with this code. Specially because if I use visualize() to see the data graph, It builds the right data sequence I want.

Initially, I thought that overlap requires compute() before like rechunk does, but I've already tested that too.



Solution 1:[1]

Like many dask operations, da.overlap operations can either be passed a meta argument specifying the output types and dimensions, or dask will execute the function with a small (or length zero) subset of the data.

From the dask.array.map_overlap docs:

Note that this function will attempt to automatically determine the output array type before computing it, please refer to the meta keyword argument in map_blocks if you expect that the function will not succeed when operating on 0-d arrays.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Michael Delgado