'Numba - Shared memory in CUDA kernel not updating correctly
Consider the following kernel, which counts the number of elements in x which are less than or equal to the corresponding element in y.
@cuda.jit
def count_leq(x, y, out):
i = cuda.grid(1)
shared = cuda.shared.array(1, dtype=DTYPE)
if i < len(x):
shared[0] += x[i] <= y[i]
cuda.syncthreads()
out[0] = shared[0]
However, the increments from each thread are not being saved properly in the shared array.
a = cuda.to_device(np.arange(5)) # [0 1 2 3 4]
b = cuda.to_device(np.arange(5)) # [0 1 2 3 4]
out = cuda.to_device(np.zeros(1)) # [0]
count_leq[1,len(a)](a, b, out)
print(out[0]) # 1.0, but should be 5.0
What am I doing wrong here? I'm confused because cuda.shared.array is shared by all threads in a given block, right? How do I accumulate the increments using the same 1-element array?
I also tried the following, which failed with the same behavior as the above version.
@cuda.jit
def count_leq(x, y, out):
i = cuda.grid(1)
if i < len(x):
out[0] += x[i] <= y[i]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
