'Optimization for parallel reduction on multi-dimensional vector? [duplicate]

I am looking into the potential optimization way for my kernel code. Mark Harris's blog provides a good example on 1-D dimension vector. How can I parallelize the code for multi-dimensional data?

For example, I have two rows of data. I want to get the average value for each single row. This pseudo code can describe what I want to do:

tensor res({data.size[0]});

for(int i=0; i<data.size[0]; i++){
    float tmp = 0.0;
    for(int j=0; j<data.size[1]; j++){
      //accumulated summation for i-th row;
      tmp += data.at(i, j);
    }
    //average value for i-th row;
    res.at(i) = tmp / float(data.size[1]);
  }

For the inner loop, I can easily adapt the methods to parallelize the execution. Is there is any suggestion for the outer loop optimization? So I can parallelize the computation for multiple rows.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source