'The way to properly do multiple CUDA block synchronization

I like to do CUDA synchronization for multiple blocks. It is not for each block where __syncthreads() can easily handle it.

I saw there are exiting discussions on this topic, for example cuda block synchronization, and I like the simple solution brought up by @johan, https://stackoverflow.com/a/67252761/3188690, essentially it uses a 64 bits counter to track the synchronized blocks.

However, I wrote the following code trying to accomplish the similar job but meet a problem. Here I used the term environment so that the wkNumberEnvs of blocks within this environment shall be synchronized. It has a counter. I used atomicAdd() to count how many blocks have already been synchronized themselves, once the number of sync blocks == wkBlocksPerEnv, I know all blocks finished sync and it is free to go. However, it has a strange outcome that I am not sure why.

The problem comes from this while loop. Since the first threads of all blocks are doing the atomicAdd, there is a while loop to check until the condition meets. But I find that some blocks will be stuck into the endless loop, which I am not sure why the condition cannot be met eventually? And if I printf some messages either in *** I can print here 1 or *** I can print here 2, there is no endless loop and everything is perfect. I do not see something obvious.

const int wkBlocksPerEnv = 2;

__device__ int env_sync_block_count[wkNumberEnvs];

__device__ void syncthreads_for_env(){
    // sync threads for each block so all threads in this block finished the previous tasks
    __syncthreads();

    // sync threads for wkBlocksPerEnv blocks for each environment
    if(wkBlocksPerEnv > 1){
       const int kThisEnvId = get_env_scope_block_id(blockIdx.x);

       if (threadIdx.x == 0){
            // incrementing env_sync_block_count by 1
            atomicAdd(&env_sync_block_count[kThisEnvId], 1);
            // *** I can print here 1
            while(env_sync_block_count[kThisEnvId] != wkBlocksPerEnv){
            // *** I can print here 2
            }

    // Do the next job ...
    }
}


Solution 1:[1]

Atomic value is going to global memory but in the while-loop you read it directly and it must be coming from the cache which will not automatically synchronize between threads (cache-coherence only handled by explicit synchronizations like threadfence). Thread gets its own synchronization but other threads may not see it.

Even if you use threadfence, the threads in same warp would be in dead-lock waiting forever if they were the first to check the value before any other thread updates it. But should work with newest GPUs supporting independent thread scheduling.

Solution 2:[2]

I like to do CUDA synchronization for multiple blocks.

You should learn to dis-like it. Synchronization is always costly, even when implemented just right, and inter-core synchronization all the more so.

if (threadIdx.x == 0){
    // incrementing env_sync_block_count by 1
    atomicAdd(&env_sync_block_count[kThisEnvId], 1);
    while(env_sync_block_count[kThisEnvId] != wkBlocksPerEnv)
       // OH NO!!

{
    }
}

This is bad. With this code, the first warp of each block will perform repeated reads of env_sync_block_count[kThisEnvId]. First, and as @AbatorAbetor mentioned, you will face the problem of cache incoherence, causing your blocks to potentially read the wrong value from a local cache well after the global value has long changed.

Also, your blocks will hog up the multiprocessors. Blocks will stay resident and have at least one active warp, indefinitely. Who's to say the will be evicted from their multiprocessor to schedule additional blocks to execute? If I were the GPU, I wouldn't allow more and more active blocks to pile up. Even if you don't deadlock - you'll be wasting a lot of time.

Now, @AbatorAbetor's answer avoids the deadlock by limiting the grid size. And I guess that works. But unless you have a very good reason to write your kernels this way - the real solution is to just break up your algorithm into consecutive kernels (or better yet, figure out how to avoid the need to synchronize altogether).


a mid-way approach is to only have some blocks get past the point of synchronization. You could do that by not waiting except on some condition which holds for a very limited number of blocks (say you had a single workgroup - then only the blocks which got the last K possible counter values, wait).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 einpoklum