'How does vector_length and num_workers work in an OpenACC routine?

When using an OpenACC "#pragma acc routine worker"-routine, that contains multiple loops of vector (and worker) level parallelism, how do vector_length and num_workers work?

I played around with some code (see below) and stumbled upon a few things:

  1. Setting the vector length of these loops is seriously confusing me. Using the vector_length(#) clause on the outer parallel region seems to work weirdly, when comparing run times. When I increase the vector length to huge numbers, say e.g. 4096, the run time actually gets smaller. In my understanding, a huge amount of threads should lie dormant when there are only as many as 10 iterations in the vector loop. Am I doing something wrong here?
  2. I noticed that the output weirdly depends on the number of workers in foo(). If it is 16 or smaller, the output is "correct". If it is 32 and even much larger, the loops inside the worker routine somehow get executed twice. What am I missing here?

Can someone give me a hand with the OpenACC routine clause? Many thanks in advance.


Here is the example code:

#include <iostream>
#include <chrono>

class A{
public:
    int out;
    int* some_array;
    A(){
        some_array = new int[1000*100*10];
        for(int i = 0; i < 1000*100*10; ++i){
            some_array[i] = 1;
        }
        #pragma acc enter data copyin(this, some_array[0:1000*100*10])
    };
    
    ~A(){ 
        #pragma acc exit data delete(some_array, this)
        delete [] some_array;
    }
    
    #pragma acc routine worker
    void some_worker(int i){
        int private_out = 10;
        #pragma acc loop vector reduction(+: private_out)
        for(int j=0; j < 10; ++j){
            //do some stuff
            private_out -= some_array[j];
        }
        #pragma acc loop reduction(+: private_out) worker
        for(int j=0; j < 100; ++j){
            #pragma acc loop reduction(+: private_out) vector
            for(int k=0; k < 10; ++k){
                //do some other stuff
                private_out += some_array[k+j*10+i*10*100];
            }
        }
        #pragma acc atomic update
        out += private_out;
    }
    
    void foo(){
        #pragma acc data present(this, some_array[0:1000*100*10]) pcreate(out)
        {
            #pragma acc serial
            out=0;
            //#######################################################
            //# setting num_workers and vector_length produce weird #
            //# results and runtimes                                #
            //#######################################################
            #pragma acc parallel loop gang num_workers(64) vector_length(4096)
            for(int i=0; i < 1000; ++i){
                some_worker(i);
            }
            #pragma acc update host(out)
        }
    }
};

int main() {
    using namespace std::chrono;
    A a;
    auto start = high_resolution_clock::now();
    a.foo();
    auto stop = high_resolution_clock::now();
    std::cout << a.out << std::endl
              << "took " << duration_cast<microseconds>(stop - start).count() << "ms" << std::endl;
    //output for num_workers(16) vector_length(4096)
    //1000000
    //took 844ms
    //
    //output for num_workers(16) vector_length(2)
    //1000000
    //took 1145ms
    //
    //output for num_workers(32) vector_length(2)
    //1990000
    //took 1480ms
    //
    //output for num_workers(64) vector_length(1)
    //1990000
    //took 502ms
    //
    //output for num_workers(64) vector_length(4096)
    //1000000
    //took 853ms
    return 0;
}

Machine specs: nvc++ 21.3-0 with OpenACC 2.7, Tesla K20c with cc35, NVIDIA-driver 470.103.01 with CUDA 11.4


Edit:

Additional information for 2.:

I simply used some printfs in the worker to look into the intermediate results. I placed them during the implicit barriers between the loops. I could see that the value of private_out went from initially 10

  • to -10 instead of 0 between the loops and
  • to 1990instead of 1000.

This just looks to me like both loops are being executed twice.

More results for convenience

To add some strangeness of this example: The code does not compile for some combinations of num_workers/vector_length. For e.g leaving num_workers just at 64 and setting the vector_length to 2,4,8,16 and even to 32 (which increases the threads over the limit of 1024). It gives the error message

ptxas error   : Entry function '_ZN1A14foo_298_gpu__1Ev' with max regcount of 32 calls function '_ZN1A11some_workerEi' with regcount of 41

However, simply inserting the printfs as described above, it suddenly compiles fine but runs into a runtime error: "call to cuLaunchKernel returned error 1: Invalid value".

But the most strange is, that it compiles and runs fine for 64/64 but returns incorrect results. Below is the output of this setting with NV_ACC_TIME=1, but note that the output is almost exactly the same for all compiling and running configurations, except for the block: [1x#-######]-part.

Accelerator Kernel Timing data
/path/to/src/main.cpp
  _ZN1AC1Ev  NVIDIA  devicenum=0
    time(us): 665
    265: data region reached 1 time
        265: data copyin transfers: 3
             device time(us): total=665 max=650 min=4 avg=221
/path/to/src/main.cpp
  _ZN1AD1Ev  NVIDIA  devicenum=0
    time(us): 8
    269: data region reached 1 time
        269: data copyin transfers: 1
             device time(us): total=8 max=8 min=8 avg=8
/path/to/src/main.cpp
  _ZN1A3fooEv  NVIDIA  devicenum=0
    time(us): 1,243
    296: data region reached 2 times
    298: compute region reached 2 times
        298: kernel launched 2 times
            grid: [1-1000]  block: [1-32x1-24]
             device time(us): total=1,230 max=1,225 min=5 avg=615
            elapsed time(us): total=1,556 max=1,242 min=314 avg=778
    304: update directive reached 1 time
        304: data copyout transfers: 1
             device time(us): total=13 max=13 min=13 avg=13


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source