'Atomic addition to floating point values in OpenCL for NVIDIA GPUs?
The OpenCL 3.0 specification does not seem to have intrinsics/builtins for atomic addition to floating-point values, only for integral values (and that seems to have been the case in OpenCL 1.x and 2.x as well). CUDA, however, has offered floating-point atomics for a while now:
float atomicAdd(float* address, float val); // since Fermi
double atomicAdd(double* address, double val); // since Pascal
__half atomicAdd(__half *address, __half val); // ?
Naturally, any straightforward atomic operation can be simulated with compare-and-exchange, and this is available in OpenCL. But my questions are:
- Does NVIDIA expose floating-point atomics in OpenCL somehow? e.g. via a vendor extension? using pragmas? implicitly?
- Is there a more efficient mechanism than simulation with compare-exchange, which I could consider as a substitute for floating-point atomics? For NVIDIA GPUs or generally?
Solution 1:[1]
Native foating-point atomics are a much desired extension for OpenCL 3.0. As of right now, they are still not available.
- The only possible way would be to use inline PTX.
- No. The implementation with atomic compare-exchange for FP32 and FP64 is currently state-of-the-art and there is no known better way.
Solution 2:[2]
As @ProjectPhysX implied in their answer, when you compile OpenCL with NVIDIA's driver, it accepts inline PTX assembly (which is of course not at all part of OpenCL nor a recognized vendor extension). This lets you basically do anything CUDA offers you - in OpenCL; and that includes atomically adding to floating point values.
So, here are wrapper functions for atomically adding to single-precision (32-bit) floating point values in global and in local memory:
float atomic_add_float_global(__global float* p, float val)
{
float prev;
asm volatile(
"atom.global.add.f32 %0, [%1], %2;"
: "=f"(prev)
: "l"(p) , "f"(val)
: "memory"
);
return prev;
}
float atomic_add_float_local(__local float* p, float val)
{
float prev;
// Remember "local" in OpenCL means the same as "shared" in CUDA.
asm volatile(
"atom.shared.add.f32 %0, [%1], %2;"
: "=f"(prev)
: "l"(p) , "f"(val)
: "memory"
);
return prev;
}
One could also perhaps tweak it by checking whether the OpenCL driver is NVIDIA's, in which case the inline assembly is used, or non-NVIDIA, in which the atomic-compare-exchange implementation is used.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ProjectPhysX |
| Solution 2 |
