'Looping reshape and matmul in OpenMP

I was debugging some piece of parallel code and found a reshape operation messed up OpenMP. This is a demo to reproduce the issue. I am not very familiar with using OpenMP yet so I'd like to know the reason about what am I doing wrong here, and if there is a better way to do things, (i.e. how best to have reshape and matmul nested in do loops). I have read OpenBLAS as a potential solution but would first like to know why. Thank you in advance

program unittest

    complex*16, save, dimension(10,10) :: testmat
    integer :: i
    real :: t0, t1, t2

    !$call OMP_set_num_threads(12);
    !$call OMP_set_dynamic(.FALSE.);
    testmat = 0.d0;

    call cpu_time(t0);
    !$OMP parallel
    !$OMP DO
    do i=1,1000000
         testmat = reshape(reshape(testmat,(/100,1/)),(/10,10/));
    end do
    !$OMP END DO
    !$OMP end parallel
    call cpu_time(t1);
    do i=1,1000000
         testmat = reshape(reshape(testmat,(/100,1/)),(/10,10/));
    end do
    call cpu_time(t2);
    print *, 'parallel time, ', t1-t0, 's, single thread time, ', t2-t1, 's'

end program unittest

Compiled with gfortran on MinGW. Output on my machine is

(with parallel) 10.01 s (single thread) 0.328 s

CPU registers less than 20% usage overall for the parallel case which probably means something is holding up OpenMP?

====================

Edit:

Thank you. Some clarification, the following is okay-ish, as in, the parallel version does not run slower (both completes around the same amount of time)

    !$OMP parallel private(testmat2)
    !$OMP DO
    do i=1,1000000
        testmat2 = testmat * 10.d0;
    end do
    !$OMP END DO
    !$OMP end parallel

but this runs much slower in parallel than on single thread (takes 50x more time in parallel than single)

    !$OMP parallel private(testmat2)
    !$OMP DO
    do i=1,1000000
        testmat2 = reshape(reshape(testmat,(/100,1/)),(/10,10/));
    end do
    !$OMP END DO
    !$OMP end parallel

So... what is special about reshape that causes this?



Solution 1:[1]

First of all, a reshape operation is not an operation. It's a matter of internal book keeping: you're telling Fortran that an 100x1 array is now to be interpreted as 10x10, or so. This basically takes zero time.

Next, you have a parallel do, which means that the loop iterations get divided over the available threads. Meaning: they get done multiple times. So now you have something that is some internal book keeping, and you do that multiple times. I'm only a little surprised that that gives strange results.

If you want to use a parallel for, you need to have a loop where each iteration does something involving the loop index. For instance v(i) = .....

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Victor Eijkhout