'How does vector_length and num_workers work in an OpenACC routine?
When using an OpenACC "#pragma acc routine worker"-routine, that contains multiple loops of vector (and worker) level parallelism, how do vector_length and num_workers work?
I played around with some code (see below) and stumbled upon a few things:
- Setting the vector length of these loops is seriously confusing me. Using the
vector_length(#)clause on the outerparallelregion seems to work weirdly, when comparing run times. When I increase the vector length to huge numbers, say e.g.4096, the run time actually gets smaller. In my understanding, a huge amount of threads should lie dormant when there are only as many as10iterations in the vector loop. Am I doing something wrong here? - I noticed that the output weirdly depends on the number of workers in
foo(). If it is16or smaller, the output is "correct". If it is32and even much larger, the loops inside the worker routine somehow get executed twice. What am I missing here?
Can someone give me a hand with the OpenACC routine clause? Many thanks in advance.
Here is the example code:
#include <iostream>
#include <chrono>
class A{
public:
int out;
int* some_array;
A(){
some_array = new int[1000*100*10];
for(int i = 0; i < 1000*100*10; ++i){
some_array[i] = 1;
}
#pragma acc enter data copyin(this, some_array[0:1000*100*10])
};
~A(){
#pragma acc exit data delete(some_array, this)
delete [] some_array;
}
#pragma acc routine worker
void some_worker(int i){
int private_out = 10;
#pragma acc loop vector reduction(+: private_out)
for(int j=0; j < 10; ++j){
//do some stuff
private_out -= some_array[j];
}
#pragma acc loop reduction(+: private_out) worker
for(int j=0; j < 100; ++j){
#pragma acc loop reduction(+: private_out) vector
for(int k=0; k < 10; ++k){
//do some other stuff
private_out += some_array[k+j*10+i*10*100];
}
}
#pragma acc atomic update
out += private_out;
}
void foo(){
#pragma acc data present(this, some_array[0:1000*100*10]) pcreate(out)
{
#pragma acc serial
out=0;
//#######################################################
//# setting num_workers and vector_length produce weird #
//# results and runtimes #
//#######################################################
#pragma acc parallel loop gang num_workers(64) vector_length(4096)
for(int i=0; i < 1000; ++i){
some_worker(i);
}
#pragma acc update host(out)
}
}
};
int main() {
using namespace std::chrono;
A a;
auto start = high_resolution_clock::now();
a.foo();
auto stop = high_resolution_clock::now();
std::cout << a.out << std::endl
<< "took " << duration_cast<microseconds>(stop - start).count() << "ms" << std::endl;
//output for num_workers(16) vector_length(4096)
//1000000
//took 844ms
//
//output for num_workers(16) vector_length(2)
//1000000
//took 1145ms
//
//output for num_workers(32) vector_length(2)
//1990000
//took 1480ms
//
//output for num_workers(64) vector_length(1)
//1990000
//took 502ms
//
//output for num_workers(64) vector_length(4096)
//1000000
//took 853ms
return 0;
}
Machine specs: nvc++ 21.3-0 with OpenACC 2.7, Tesla K20c with cc35, NVIDIA-driver 470.103.01 with CUDA 11.4
Edit:
Additional information for 2.:
I simply used some printfs in the worker to look into the intermediate results. I placed them during the implicit barriers between the loops. I could see that the value of private_out went from initially 10
- to
-10instead of0between the loops and - to
1990instead of1000.
This just looks to me like both loops are being executed twice.
More results for convenience
To add some strangeness of this example: The code does not compile for some combinations of num_workers/vector_length. For e.g leaving num_workers just at 64 and setting the vector_length to 2,4,8,16 and even to 32 (which increases the threads over the limit of 1024). It gives the error message
ptxas error : Entry function '_ZN1A14foo_298_gpu__1Ev' with max regcount of 32 calls function '_ZN1A11some_workerEi' with regcount of 41
However, simply inserting the printfs as described above, it suddenly compiles fine but runs into a runtime error: "call to cuLaunchKernel returned error 1: Invalid value".
But the most strange is, that it compiles and runs fine for 64/64 but returns incorrect results. Below is the output of this setting with NV_ACC_TIME=1, but note that the output is almost exactly the same for all compiling and running configurations, except for the block: [1x#-######]-part.
Accelerator Kernel Timing data
/path/to/src/main.cpp
_ZN1AC1Ev NVIDIA devicenum=0
time(us): 665
265: data region reached 1 time
265: data copyin transfers: 3
device time(us): total=665 max=650 min=4 avg=221
/path/to/src/main.cpp
_ZN1AD1Ev NVIDIA devicenum=0
time(us): 8
269: data region reached 1 time
269: data copyin transfers: 1
device time(us): total=8 max=8 min=8 avg=8
/path/to/src/main.cpp
_ZN1A3fooEv NVIDIA devicenum=0
time(us): 1,243
296: data region reached 2 times
298: compute region reached 2 times
298: kernel launched 2 times
grid: [1-1000] block: [1-32x1-24]
device time(us): total=1,230 max=1,225 min=5 avg=615
elapsed time(us): total=1,556 max=1,242 min=314 avg=778
304: update directive reached 1 time
304: data copyout transfers: 1
device time(us): total=13 max=13 min=13 avg=13
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
