'OpenCL trying to use semaphore crashes drivers


 While writing simple OpenCL kernel I tried to use semaphores and it crushed my GPU Drivers (AMD 12.10). After checking out examples I found out, that crash happens only when local work size is not equal to 1. This code taken from example:

    #pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
    #pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
    #pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
    #pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable

    void GetSemaphor(__global int * semaphor)
    {
      int occupied = atom_xchg(semaphor, 1);
      while(occupied > 0)
      {
          occupied = atom_xchg(semaphor, 1);
      }
    }

    void ReleaseSemaphor(__global int * semaphor)
    {
       int prevVal = atom_xchg(semaphor, 0);
    }

    __kernel void kernelNoAtomInc(__global int * num,
                __global int * semaphor)
    {
      int i = get_global_id(0);
      GetSemaphor(&semaphor[0]);
      {
        num[0]++;
      }
      ReleaseSemaphor(&semaphor[0]);
    }

In example author uses

CQ.Execute(kernelNoAtomInc, null, new long[1] { N }, new long[1] { 1 }, null);

Where N = global_work_size and local_work_size = 1
Now if I change 1 to null or 2 or 4 or any other number i tried - AMD drivers will crush.

CQ.Execute(kernelNoAtomInc, null, new long[1] { N }, new long[1] { 2 }, null);

I do not have other PC to test on it at the moment. However it seems strange that author deliberately left local_group_size = 1, that's why I think I missing something here. Can someone please explain this to me? Also, as far as I understand, leaving local_group_size at 1 will affect performance greatly or it won't? Thanks.

Host: Win8 x64, HD6870



Solution 1:[1]

Your problem is not reproducible and I can furthermore not find your source from the link, but here are a few ideas on why it could crash, which should be helpful (9 years in the past).

It propably crashes, because...

  • ... the driver thinks you want the local version of that atom_xchg() function to be executed, when instead you want the global one.
  • ... your loop slows down execution of that kernel so drastically on an old machine, that an internal limit of execution time got passed, causing the driver to terminate the kernel.

What I can suggest for a possible fix:

  • do not activate the local version of the atom function in your kernel
  • Try running it on CPU

There is no way to fix this, unless we could access your computer and debug on it.


You were also asking, why the author chose the local_group_size of one. This is because the global work size needs to be divisible by the local work size, such that the division results in a natural number. Dividing a natural number by one always results in a natural number, therefor this is perfect for experimenting. You are completely correct by saying that it will affect performance greatly. (Just maybe the maths didn't add up and it didn't crash, but not even start)


Different notes:

  • To make the incrementing be functionally correct, you should use an atom_inc() on your num buffer. I don't see how this could lead to a crash, but it definitely makes your program not work as intended
  • I would go and use the atomic functions from the 2.0 standard, since they already feature a semaphore-like functions: bool atomic_flag_test_and_set(volatile atomic_flag *object) and void atomic_flag_clear(volatile atomic_flag *object)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1