'C++ with OpenMP try to avoid the false sharing for tight looped array
I try to introduce OpenMP to my c++ code to improve the performance using a simple case as shown:
#include <omp.h>
#include <chrono>
#include <iostream>
#include <cmath>
using std::cout;
using std::endl;
#define NUM 100000
int main()
{
double data[NUM] __attribute__ ((aligned (128)));;
#ifdef _OPENMP
auto t1 = omp_get_wtime();
#else
auto t1 = std::chrono::steady_clock::now();
#endif
for(long int k=0; k<100000; ++k)
{
#pragma omp parallel for schedule(static, 16) num_threads(4)
for(long int i=0; i<NUM; ++i)
{
data[i] = cos(sin(i*i+ k*k));
}
}
#ifdef _OPENMP
auto t2 = omp_get_wtime();
auto duration = t2 - t1;
cout<<"OpenMP Elapsed time (second): "<<duration<<endl;
#else
auto t2 = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
cout<<"No OpenMP Elapsed time (second): "<<duration/1e6<<endl;
#endif
double tempsum = 0.;
for(long int i=0; i<NUM; ++i)
{
int nextind = (i == 0 ? 0 : i-1);
tempsum += i + sin(data[i]) + cos(data[nextind]);
}
cout<<"Raw data sum: "<<tempsum<<endl;
return 0;
}
Access to a tightly looped int array (size = 10000) and change its elements in either parallel or non-parallel way.
Build as
g++ -o test test.cpp
or
g++ -o test test.cpp -fopenmp
The program reported results as:
No OpenMP Elapsed time (second): 427.44
Raw data sum: 5.00009e+09
OpenMP Elapsed time (second): 113.017
Raw data sum: 5.00009e+09
Intel 10th CPU, Ubuntu 18.04, GCC 7.5, OpenMP 4.5.
I suspect that the false sharing in the cache line leads to the bad performance of the OpenMP version code.
I update the new test results after increasing the loop size, the OpenMP runs faster as expected.
Thank you!
Solution 1:[1]
- Since you're writing C++, use the C++ random number generator, which is threadsafe, unlike the C legacy one you're using.
- Also, you're not using your data array, so the compiler is actually at liberty to remove your loop completely.
- You should touch all your data once before you do the timed loop. That way you ensure that pages are instantiated and data is in or out of cache depending.
- Your loop is pretty short.
Solution 2:[2]
rand()is not thread-safe (see here). Use an array of C++ random-number generators instead, one for each thread. Seestd::uniform_int_distributionfor details.- You can drop
#ifdef _OPENMPvariations in your code. In a Bash terminal, you can call your application asOMP_NUM_THREADS=1 test. See here for details. - So you can remove
num_threads(4)as well because you can explicitly specify the amount of parallelism. - Use Google Benchmark or command-line parameters so you can parameterize the number of threads and array size.
From here, I expect you will see:
- The performance when you call
OMP_NUM_THREADS=1 testis close to your non-OpenMP version. - The array of C++ RNG generators is faster than calling
rand()from multiple threads. - The multi-threaded version is still slower than the single-threaded version when using a 10,000 element array.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Victor Eijkhout |
| Solution 2 | Daniel Dearlove |
