'Monitor active warps and threads during a divergent CUDA run
I implemented some CUDA code. It runs fine but the alogrithm inherently produces a strong thread divergence. This is expected.
I will later try to reduce divergence. But for the moment I would be happy to be able to measure it.
Is there an easy way (prefereably using a runtime API call or a CLI tool) to check how many of my initially scheduled warps and/or threads are still active?
Solution 1:[1]
Besides the solutions given in the comments, you can use Nsight Compute to profile your kernels. You can try its CLI and then see the results in its GUI, e.g.:
ncu --export output --force-overwrite --target-processes application-only \
--replay-mode kernel --kernel-regex-base function --launch-skip-before-match 0 \
--section InstructionStats \
--section Occupancy \
--section SchedulerStats \
--section SourceCounters \
--section WarpStateStats \
--sampling-interval auto \
--sampling-max-passes 5 \
--profile-from-start 1 --cache-control all --clock-control base \
--apply-rules yes --import-source no --check-exit-code yes \
your-appication [arguments]
Then, in its GUI you can see some useful information. For example, in the section source counters you can see something like this:
Solution 2:[2]
I found a solution that gives me pretty nice results. Calling the following function from some lines of a kernel (and adapted using a proper filter condition) prints the number of active threads of the current warp:
__device__ void printConvergentThreadCount(int line) // Pass __LINE__
{
const int count = __popc(__activemask());
const int threadId = blockIdx.x * blockDim.x + threadIdx.x;
if (threadId == 0) // Filter
{
printf("Line %i: %i\n", line, count);
}
}
Still this doesn't give numbers as long as kernels are running.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | AmirSojoodi |
| Solution 2 |

