'Need Help Understanding OpenMP Matrix Multiplication C++ code

Here is my Matrix Multiplication C++ OpenMP code that I have written. I am trying to use OpenMP to optimize the program. The sequential code speed was 7 seconds but when I added openMP statements but it only got faster by 3 seconds. I thought it was going to get much faster and don't understand if I'm doing it right.

The OpenMP statements are in the fill_random function and in the matrix multiplication triple for loop section in main.

I would appreciate any help or advice you can give to understand this!

#include <iostream>
#include <cassert>
#include <omp.h>
#include <chrono>

using namespace std::chrono;


double** fill_random(int rows, int cols )
{
    
    double** mat = new double* [rows]; //Allocate rows.
    #pragma omp  parallell collapse(2) 
    for (int i = 0; i < rows; ++i)
    {
        mat[i] = new double[cols];           // added
        for( int j = 0;  j < cols; ++j)
        {
            mat[i][j] = rand() % 10;
        }
       
    }
     return mat;
}


double** create_matrix(int rows, int cols)
{
    double** mat = new double* [rows]; //Allocate rows.
    for (int i = 0; i < rows; ++i)
    {
        mat[i] = new double[cols](); //Allocate each row and zero initialize..
    }
    return mat;
}

void destroy_matrix(double** &mat, int rows)
{
    if (mat)
    {
        for (int i = 0; i < rows; ++i)
        {
            delete[] mat[i]; //delete each row..
        }

        delete[] mat;  //delete the rows..
        mat = nullptr;
    }
}

int main()
{
    int rowsA = 1000; // number of rows
    int colsA= 1000; // number of columns
    double** matA = fill_random(rowsA, colsA);


    int rowsB = 1000; // number of rows
    int colsB = 1000; // number of columns
    double** matB = fill_random(rowsB, colsB);


//Checking matrix multiplication qualification
    assert(colsA == rowsB);


    double** matC = create_matrix(rowsA, colsB);

    //measure the multiply only
    const auto start = high_resolution_clock::now();

    //Multiplication
    #pragma omp parallel for 
    
    for(int i = 0; i < rowsA; ++i)
    {
        for(int j = 0; j < colsB; ++j)
        {
            for(int k = 0; k < colsA; ++k) //ColsA..
            {
                matC[i][j] += matA[i][k] * matB[k][j];
            }
        }
        
    }

    const auto stop = high_resolution_clock::now();
    const auto duration = duration_cast<seconds>(stop - start);

    std::cout << "Time taken by function: " << duration.count() << " seconds" << std::endl;



    //Clean up..
    destroy_matrix(matA, rowsA);
    destroy_matrix(matB, rowsB);
    destroy_matrix(matC, rowsA);

    return 0;
}


Solution 1:[1]

  1. Your problem is rather small.
  2. The collapse in the matrix creation does nothing because the loops are not perfectly nested. On the other hand, in the multiplication routine you should add a collapse(2) directive.
  3. Creating a matrix with an array of pointers means that the expression matB[k][j] dances all over memory. Allocate your matrices as a single array and then use i*N+j as an indexing expression. (Of course I would put that in a macro or so.)

Solution 2:[2]

Matrix size of 1000x1000 with double(64 bit) element type requires 8MB data. When you multiply two matrices, you read 16MB data. When you write to a third matrix, you also access 24MB data total.

If L3 cache is smaller than 24MB then RAM is bottleneck. Maybe single thread did not fully use its bandwidth but when OpenMP is used, RAM bandwidth is fully used. In your case it had only 50% headroom for bandwidth.

Naive version is not using cache well. You need to swap order of two loops to gain more caching:

loop
   loop k
     loop
       C[..] += B[..] * A[..]

although incrementing C does not re-use a register in this optimized version, it re-uses cache that is more important in this case. If you do it, it should get ~100-200 milliseconds computation time even in single-thread.

Also if you need performance, don't do this:

//Allocate each row and zero initialize..

allocate whole matrix at once so that your matrix is not scattered in memory.

To add more threads efficiently, you can do sub-matrix multiplications to compute full matrix multiplication. Scan-line multiplication is not good for load-balancing between threads. When sub-matrices are multiplied, they give better load distribution due to caching and higher floating-point operations per element fetched from memory.

Edit:

Swapping order of loops also makes compiler able to vectorize the innermost loop because one of the input matrices becomes a constant during the innermost loop.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Victor Eijkhout
Solution 2