'How can I control the data placement and work execution of a TBB program on a NUMA CPU?

I am trying to create a parallel application with TBB's high-level NUMA functionalities, that distributes the data and work on a machine with 2 NUMA nodes. To better compartmentalize the questions I am having, which are linked to TBB's high-level NUMA support, I provided a self-contained simple example code in which the working threads store half of each vector in their local memory through tbb's parallel_for with the static_partitioner. A task_arena is created afterwards for each NUMA node to create both vector addition tasks. For test purposes I let both nodes work perform the whole addition and store the result in distinct arrays. When separately measuring the addition operation on two halves of the vector for each NUMA node, one runtime takes longer than the other as can be seen in the results further below.

#include <iostream>
#include <tbb/tbb.h>


typedef double value_type;

int main(int argc, char* argv[]){

    std::size_t size = 1000000000;
    value_type pulse = 0.2;

    value_type * A = (value_type *) malloc(sizeof(value_type)*size);
    value_type * B = (value_type *) malloc(sizeof(value_type)*size);
    value_type * C = (value_type *) malloc(sizeof(value_type)*size);
    value_type * D = (value_type *) malloc(sizeof(value_type)*size);

    value_type ** SUMVEC = (value_type **) malloc(sizeof(value_type*)*2);
    SUMVEC[0] = C;
    SUMVEC[1] = D;

    tbb::parallel_for(static_cast<std::size_t>(0), size,
      [&A, &B, &C, &SUMVEC](std::size_t i)
    {
        A[i] = B[i] = 1;
        SUMVEC[0][i] = SUMVEC[1][i] = 0;
    }, 
    tbb::static_partitioner());

    std::vector<int> numa_indexes = tbb::info::numa_nodes();
    std::vector<tbb::task_arena> arenas(numa_indexes.size());
    for(unsigned j = 0; j < numa_indexes.size(); j++){
        arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
    }
    for(int i : numa_indexes){
        std::cout << "NUMA NODE:  " << i << std::endl;
    }

    std::vector<double> time1 = {0,0};
    std::vector<double> time2 = {0,0};

    for(unsigned j = 0; j < numa_indexes.size(); j++){
        arenas[j].execute(
            [&A, &B, &SUMVEC, &size, &j, &time1, &time2](){
                tbb::tick_count t0 = tbb::tick_count::now();
                for(std::size_t i = 0; i<size/2; i++){
                    SUMVEC[j][i] = A[i] + B[i];
                }
                tbb::tick_count t1 = tbb::tick_count::now();
                time1[j] = (t1-t0).seconds();
                t0 = tbb::tick_count::now();
                for(std::size_t i = size/2; i<size; i++){
                    SUMVEC[j][i] = A[i] + B[i];
                }
                t1 = tbb::tick_count::now();
                time2[j] = (t1-t0).seconds();
        });
    }

    for(int i: numa_indexes){
        std::cout << "NUMA NODE "<< i <<" Time 1: " << time1[i] << std::endl;
        std::cout << "NUMA NODE "<< i <<" Time 2: " << time2[i] << std::endl;
    }
}
NUMA NODE:  0
NUMA NODE:  1
NUMA NODE 0 Time 1: 2.20939
NUMA NODE 0 Time 2: 0.985955
NUMA NODE 1 Time 1: 0.999695
NUMA NODE 1 Time 2: 2.11766
NUMA NODE:  0
NUMA NODE:  1
NUMA NODE 0 Time 1: 0.987052
NUMA NODE 0 Time 2: 2.35013
NUMA NODE 1 Time 1: 2.02715
NUMA NODE 1 Time 2: 0.991963

My Questions would be:

  1. As you can see from the results, it is not guaranteed that the first half of a vector is stored in the first NUMA node. Would using the task_arena for the data initialization step guarantee that, if the program is run on a CPU with 2 NUMA nodes?

  2. If I omit the task_arena construct and parallelize both the data initialization and the vector addition step through parallel_for with the static_partitioner, is it guaranteed that in the second parallel_for a task operating on a certain section of the vector is sent to the thread that wrote that section in its local memory or at least to a another thread working in the same NUMA node?

  3. Does the task_arena section have an implicit barrier? If I were to create another one after it, would all threads working on the previous one have to finish, before the next section is run?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source