'Optimizing a matrix transpose function with OpenMP

I have this code that transposes a matrix using loop tiling strategy.

void transposer(int n, int m, double *dst, const double *src) {
   int blocksize;
   for (int i = 0; i < n; i += blocksize) {
      for (int j = 0; j < m; j += blocksize) {
          // transpose the block beginning at [i,j]
          for (int k = i; k < i + blocksize; ++k) {
              for (int l = j; l < j + blocksize; ++l) {
                  dst[k + l*n] = src[l + k*m];
               }
          }
      }
   }
}

I want to optimize this with multi-threading using OpenMP, however I am not sure what to do when having so many nested for loops. I thought about just adding #pragma omp parallel for but doesn't this just parallelize the outer loop?



Solution 1:[1]

You can use the collapse specifier to parallelize over two loops.

#   pragma omp parallel for collapse(2) 
    for (int i = 0; i < n; i += blocksize) {
        for (int j = 0; j < m; j += blocksize) {
            // transpose the block beginning at [i,j]
            for (int k = i; k < i + blocksize; ++k) {
                for (int l = j; l < j + blocksize; ++l) {
                    dst[k + l*n] = src[l + k*m];
                }
            }
        }
    }

As a side-note, I think you should swap the two innermost loops. Usually, when you have a choice between writing sequentially and reading sequentially, writing is more important for performance.

Solution 2:[2]

I thought about just adding #pragma omp parallel for but doesnt this just parallelize the outer loop?

Yes. To parallelize multiple consecutive loops one can utilize OpenMP' collapse clause. Bear in mind, however that:

  • (As pointed out by Victor Eijkhout). Even though this does not directly apply to your code snippet, typically, for each new loop to be parallelized one should reason about potential newer race-conditions e.g., that this parallelization might have added. For example, different threads writing concurrently into the same dst position.

  • in some cases parallelizing nested loops may result in slower execution times than parallelizing a single loop. Since, the concrete implementation of the collapse clause uses a more complex heuristic (than the simple loop parallelization) to divide the iterations of the loops among threads, which can result in an overhead higher than the gains that it provides.

You should try to benchmark with a single parallel loop and then with two, and so on, and compare the results, accordingly.

void transposer(int n, int m, double *dst, const double *src) {
    int blocksize;
    #pragma omp parallel for collapse(...)
    for (int i = 0; i < n; i += blocksize)
       for (int j = 0; j < m; j += blocksize)
           for (int k = i; k < i + blocksize; ++k
               for (int l = j; l < j + blocksize; ++l)
                  dst[k + l*n] = src[l + k*m];
} 

Depending upon the number of threads, cores, size of the matrices among other factors it might be that running sequential would actually be faster than the parallel versions. This is specially true in your code that is not very CPU intensive (i.e., dst[k + l*n] = src[l + k*m];)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Homer512
Solution 2