'Numba - Shared memory in CUDA kernel not updating correctly

Consider the following kernel, which counts the number of elements in x which are less than or equal to the corresponding element in y.

@cuda.jit
def count_leq(x, y, out):
    i = cuda.grid(1)
    shared = cuda.shared.array(1, dtype=DTYPE)
    if i < len(x):
        shared[0] += x[i] <= y[i]
    cuda.syncthreads()
    out[0] = shared[0]

However, the increments from each thread are not being saved properly in the shared array.

a = cuda.to_device(np.arange(5))  # [0 1 2 3 4]
b = cuda.to_device(np.arange(5))  # [0 1 2 3 4]
out = cuda.to_device(np.zeros(1)) # [0]
count_leq[1,len(a)](a, b, out)
print(out[0])                     # 1.0, but should be 5.0

What am I doing wrong here? I'm confused because cuda.shared.array is shared by all threads in a given block, right? How do I accumulate the increments using the same 1-element array?

I also tried the following, which failed with the same behavior as the above version.

@cuda.jit
def count_leq(x, y, out):
    i = cuda.grid(1)
    if i < len(x):
        out[0] += x[i] <= y[i]

python cuda numba

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Numba - Shared memory in CUDA kernel not updating correctly

Sources

Related Questions