'How can I control the data placement and work execution of a TBB program on a NUMA CPU?
I am trying to create a parallel application with TBB's high-level NUMA functionalities, that distributes the data and work on a machine with 2 NUMA nodes. To better compartmentalize the questions I am having, which are linked to TBB's high-level NUMA support, I provided a self-contained simple example code in which the working threads store half of each vector in their local memory through tbb's parallel_for with the static_partitioner. A task_arena is created afterwards for each NUMA node to create both vector addition tasks. For test purposes I let both nodes work perform the whole addition and store the result in distinct arrays. When separately measuring the addition operation on two halves of the vector for each NUMA node, one runtime takes longer than the other as can be seen in the results further below.
#include <iostream>
#include <tbb/tbb.h>
typedef double value_type;
int main(int argc, char* argv[]){
std::size_t size = 1000000000;
value_type pulse = 0.2;
value_type * A = (value_type *) malloc(sizeof(value_type)*size);
value_type * B = (value_type *) malloc(sizeof(value_type)*size);
value_type * C = (value_type *) malloc(sizeof(value_type)*size);
value_type * D = (value_type *) malloc(sizeof(value_type)*size);
value_type ** SUMVEC = (value_type **) malloc(sizeof(value_type*)*2);
SUMVEC[0] = C;
SUMVEC[1] = D;
tbb::parallel_for(static_cast<std::size_t>(0), size,
[&A, &B, &C, &SUMVEC](std::size_t i)
{
A[i] = B[i] = 1;
SUMVEC[0][i] = SUMVEC[1][i] = 0;
},
tbb::static_partitioner());
std::vector<int> numa_indexes = tbb::info::numa_nodes();
std::vector<tbb::task_arena> arenas(numa_indexes.size());
for(unsigned j = 0; j < numa_indexes.size(); j++){
arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
}
for(int i : numa_indexes){
std::cout << "NUMA NODE: " << i << std::endl;
}
std::vector<double> time1 = {0,0};
std::vector<double> time2 = {0,0};
for(unsigned j = 0; j < numa_indexes.size(); j++){
arenas[j].execute(
[&A, &B, &SUMVEC, &size, &j, &time1, &time2](){
tbb::tick_count t0 = tbb::tick_count::now();
for(std::size_t i = 0; i<size/2; i++){
SUMVEC[j][i] = A[i] + B[i];
}
tbb::tick_count t1 = tbb::tick_count::now();
time1[j] = (t1-t0).seconds();
t0 = tbb::tick_count::now();
for(std::size_t i = size/2; i<size; i++){
SUMVEC[j][i] = A[i] + B[i];
}
t1 = tbb::tick_count::now();
time2[j] = (t1-t0).seconds();
});
}
for(int i: numa_indexes){
std::cout << "NUMA NODE "<< i <<" Time 1: " << time1[i] << std::endl;
std::cout << "NUMA NODE "<< i <<" Time 2: " << time2[i] << std::endl;
}
}
NUMA NODE: 0
NUMA NODE: 1
NUMA NODE 0 Time 1: 2.20939
NUMA NODE 0 Time 2: 0.985955
NUMA NODE 1 Time 1: 0.999695
NUMA NODE 1 Time 2: 2.11766
NUMA NODE: 0
NUMA NODE: 1
NUMA NODE 0 Time 1: 0.987052
NUMA NODE 0 Time 2: 2.35013
NUMA NODE 1 Time 1: 2.02715
NUMA NODE 1 Time 2: 0.991963
My Questions would be:
As you can see from the results, it is not guaranteed that the first half of a vector is stored in the first NUMA node. Would using the
task_arenafor the data initialization step guarantee that, if the program is run on a CPU with 2 NUMA nodes?If I omit the
task_arenaconstruct and parallelize both the data initialization and the vector addition step throughparallel_forwith thestatic_partitioner, is it guaranteed that in the secondparallel_fora task operating on a certain section of the vector is sent to the thread that wrote that section in its local memory or at least to a another thread working in the same NUMA node?Does the
task_arenasection have an implicit barrier? If I were to create another one after it, would all threads working on the previous one have to finish, before the next section is run?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
