'Atomic addition to floating point values in OpenCL for NVIDIA GPUs?

The OpenCL 3.0 specification does not seem to have intrinsics/builtins for atomic addition to floating-point values, only for integral values (and that seems to have been the case in OpenCL 1.x and 2.x as well). CUDA, however, has offered floating-point atomics for a while now:

float atomicAdd(float* address, float val); // since Fermi
double atomicAdd(double* address, double val); // since Pascal
__half atomicAdd(__half *address, __half val); // ?

Naturally, any straightforward atomic operation can be simulated with compare-and-exchange, and this is available in OpenCL. But my questions are:

  1. Does NVIDIA expose floating-point atomics in OpenCL somehow? e.g. via a vendor extension? using pragmas? implicitly?
  2. Is there a more efficient mechanism than simulation with compare-exchange, which I could consider as a substitute for floating-point atomics? For NVIDIA GPUs or generally?


Solution 1:[1]

Native foating-point atomics are a much desired extension for OpenCL 3.0. As of right now, they are still not available.

  1. The only possible way would be to use inline PTX.
  2. No. The implementation with atomic compare-exchange for FP32 and FP64 is currently state-of-the-art and there is no known better way.

Solution 2:[2]

As @ProjectPhysX implied in their answer, when you compile OpenCL with NVIDIA's driver, it accepts inline PTX assembly (which is of course not at all part of OpenCL nor a recognized vendor extension). This lets you basically do anything CUDA offers you - in OpenCL; and that includes atomically adding to floating point values.

So, here are wrapper functions for atomically adding to single-precision (32-bit) floating point values in global and in local memory:

float atomic_add_float_global(__global float* p, float val)
{
    float prev;
    asm volatile(
        "atom.global.add.f32 %0, [%1], %2;" 
        : "=f"(prev) 
        : "l"(p) , "f"(val) 
        : "memory" 
    );
    return prev;
}

float atomic_add_float_local(__local float* p, float val)
{
    float prev;
    // Remember "local" in OpenCL means the same as "shared" in CUDA.
    asm volatile(
        "atom.shared.add.f32 %0, [%1], %2;"
        : "=f"(prev) 
        : "l"(p) , "f"(val) 
        : "memory" 
    );
    return prev;
}

One could also perhaps tweak it by checking whether the OpenCL driver is NVIDIA's, in which case the inline assembly is used, or non-NVIDIA, in which the atomic-compare-exchange implementation is used.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ProjectPhysX
Solution 2