'the cuda local variable makes me confused
firstly i think local variable in kernel is only belong to its thread only, but below results make me confused. Could you tell me why?
#include <stdio.h>
__global__ void bcast(int arg) {
int laneId = threadIdx.x & 0x1f;
int value; // the default value of int type should be 0
if (laneId == 0) // only thread 0 set value as arg
value = arg;
printf("thread id %d: %d\n", threadIdx.x, value);
}
int main() {
bcast<<< 1, 32 >>>(1234);
cudaDeviceSynchronize();
return 0;
}
I think only thread 0 changed the "value", so the print should be as below:
thread id 0: 1234
thread id 1: 0
...
thread id 31: 0
however all the "value" of 32 threads are 1234.
then I init “value” with 0 during declaration as below. I get the result I wanted, only thread 0's "value" is 1234, all the others are 0.
#include <stdio.h>
__global__ void bcast(int arg) {
int laneId = threadIdx.x & 0x1f;
int value = 0; // init with 0
if (laneId == 0) // only thread 0 set value as arg
value = arg;
printf("thread id %d: %d\n", threadIdx.x, value);
}
int main() {
bcast<<< 1, 32 >>>(1234);
cudaDeviceSynchronize();
return 0;
}
my questions are
- why thread 0 set "value" as "arg",all "value" become "arg" in the first code?
- why thread 0 only set its own "value" after init "value" with 0 in the second code?
Solution 1:[1]
why thread 0 set "value" as "arg",all "value" become "arg" in the first code?
Because there is an undefined behaviour in your code. Indeed, value is not initialized to 0 as pointed out by @talonmies. It is left uninitialized. Thus, the compiler is free to assume value can contain anything including arg. In fact, it does so to optimize out the code and remove the condition. You can see on Godbolt that your kernel code is equivalent to this one:
__global__ void bcast(int arg) {
int value = arg;
printf("thread id %d: %d\n", threadIdx.x, value);
}
You can fix this by initializing value to 0. This is what the second code does.
why thread 0 only set its own "value" after init "value" with 0 in the second code?
Your code execute a kernel with 1 block and 32 threads. threadIdx.x is the thread ID. threadIdx.x & 0x1f is equal to 0 only when threadIdx.x % 32 is zero. Because there is only 32 threads, only the first (threadIdx.x == 0) executes the conditional and set the value (while for other threads value is still untouched and so set to 0).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Jérôme Richard |
