'how to use parallelize two serial for loops such that the work of the two for loops are distributed over the thread

I have written the below code to parallelize two 'for' loops.

#include <iostream>
#include <omp.h>
#define SIZE 100

    int main()
    {
        int arr[SIZE];
        int sum = 0;    
        int i, tid, numt, prod;
        double t1, t2;
        for (i = 0; i < SIZE; i++)
            arr[i] = 0;     
    
        t1 = omp_get_wtime();   
    
    #pragma omp parallel private(tid, prod)
        {       
            tid = omp_get_thread_num();
            numt = omp_get_num_threads();
            std::cout << "Tid: " << tid << " Thread: " << numt << std::endl;
    #pragma omp for reduction(+: sum) 
            for (i = 0; i < 50; i++) {
                prod = arr[i]+1;
                sum += prod;
            }
                
    #pragma omp for reduction(+: sum) 
            for (i = 50; i < SIZE; i++) {
                prod = arr[i]+1;
                sum += prod;
            }                                   
    
        }
    
        t2 = omp_get_wtime();
        std::cout << "Time taken: " << (t2 - t1) << ", Parallel sum: " << sum << std::endl;
    
        return 0;
    }

In this case the execution of 1st 'for' loop is done in parallel by all the threads and the result is accumulated in sum variable. After the execution of the 1st 'for' loop is done, threads start executing the 2nd 'for' loop in parallel and the result is accumulated in sum variable. In this case clearly the execution of the 2nd 'for' loop waits for the execution of the 1st 'for' loop to get over.

I want to do the processing of the two 'for' loop simultaneously over threads. How can I do that? Is there any other way I can write this code more efficiently. Ignore the dummy work that I am doing inside the 'for' loop.



Solution 1:[1]

If you use #pragma omp for nowait all threads are assigned to the first loop, the second loop will only start if at least one thread finished in the first loop. Unfortunately, there is no way to tell the omp for construct to use e.g. only half of the threads.

Fortunately, there is a solution to do so (i.e. to run the 2 loops parallel) by using tasks. The following code will use half of the threads to run the first loop, the other half to run the second one using the taskloop construct and num_threads clause to control the threads assigned for a loop. This will do exactly what you intended, but you have to test which solution is faster in your case.

#pragma omp parallel
#pragma omp single
{       
    int n=omp_get_num_threads();
    #pragma omp taskloop num_tasks(n/2)
       for (int i = 0; i < 50; i++) {
                //do something
       }    
    #pragma omp taskloop num_tasks(n/2)
       for (int i = 50; i < SIZE; i++) {
                //do something
       }
}

UPDATE: The first paragraph is not entirely correct, by changing the chunk_size you have some control how many threads will be used in the first loop. It can be done by using e.g. schedule(linear, chunk_size) clause. So, I thought setting the chunk_size will do the trick:

#pragma omp parallel
{       
    int n=omp_get_num_threads();

    #pragma omp single
    printf("num_threads=%d\n",n);

    #pragma omp for schedule(static,2) nowait
       for (int i = 0; i < 4; i++) {
                printf("thread %d running 1st loop\n", omp_get_thread_num());
        }    
    #pragma omp for schedule(static,2) 
        for (int i = 4; i < SIZE; i++) {
                printf("thread %d running 2nd loop\n", omp_get_thread_num());
        }
}

BUT at first the result seems surprising:

num_threads=4
thread 0 running 1st loop
thread 0 running 1st loop
thread 0 running 2nd loop
thread 0 running 2nd loop
thread 1 running 1st loop
thread 1 running 1st loop
thread 1 running 2nd loop
thread 1 running 2nd loop

What is going on? Why threads 2 and 3 not used? OpenMP run-time guarantees that if you have two separate loops with the same number of iterations and execute them with the same number of threads using static scheduling, then each thread will receive exactly the same iteration ranges in both parallel regions. On the other hand result of using schedule(dynamic,2) clause was quite surprising - only one thread is used, CodeExplorer link is here.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1