'Looping reshape and matmul in OpenMP
I was debugging some piece of parallel code and found a reshape operation messed up OpenMP. This is a demo to reproduce the issue. I am not very familiar with using OpenMP yet so I'd like to know the reason about what am I doing wrong here, and if there is a better way to do things, (i.e. how best to have reshape and matmul nested in do loops). I have read OpenBLAS as a potential solution but would first like to know why. Thank you in advance
program unittest
complex*16, save, dimension(10,10) :: testmat
integer :: i
real :: t0, t1, t2
!$call OMP_set_num_threads(12);
!$call OMP_set_dynamic(.FALSE.);
testmat = 0.d0;
call cpu_time(t0);
!$OMP parallel
!$OMP DO
do i=1,1000000
testmat = reshape(reshape(testmat,(/100,1/)),(/10,10/));
end do
!$OMP END DO
!$OMP end parallel
call cpu_time(t1);
do i=1,1000000
testmat = reshape(reshape(testmat,(/100,1/)),(/10,10/));
end do
call cpu_time(t2);
print *, 'parallel time, ', t1-t0, 's, single thread time, ', t2-t1, 's'
end program unittest
Compiled with gfortran on MinGW. Output on my machine is
(with parallel) 10.01 s (single thread) 0.328 s
CPU registers less than 20% usage overall for the parallel case which probably means something is holding up OpenMP?
====================
Edit:
Thank you. Some clarification, the following is okay-ish, as in, the parallel version does not run slower (both completes around the same amount of time)
!$OMP parallel private(testmat2)
!$OMP DO
do i=1,1000000
testmat2 = testmat * 10.d0;
end do
!$OMP END DO
!$OMP end parallel
but this runs much slower in parallel than on single thread (takes 50x more time in parallel than single)
!$OMP parallel private(testmat2)
!$OMP DO
do i=1,1000000
testmat2 = reshape(reshape(testmat,(/100,1/)),(/10,10/));
end do
!$OMP END DO
!$OMP end parallel
So... what is special about reshape that causes this?
Solution 1:[1]
First of all, a reshape operation is not an operation. It's a matter of internal book keeping: you're telling Fortran that an 100x1 array is now to be interpreted as 10x10, or so. This basically takes zero time.
Next, you have a parallel do, which means that the loop iterations get divided over the available threads. Meaning: they get done multiple times. So now you have something that is some internal book keeping, and you do that multiple times. I'm only a little surprised that that gives strange results.
If you want to use a parallel for, you need to have a loop where each iteration does something involving the loop index. For instance v(i) = .....
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Victor Eijkhout |
