'Vectorization with gcc and openmp [duplicate]

With g++, I have often achieved effective parallel speed improvements using simple openmp annotations, but not with the following near trivial vectorization example. In am using g++ 7 on Ubuntu 16.04, and understand that openmp comes with the compiler.

#include <iostream>
#include <omp.h>
#include <chrono>
#include <ctime>
#include <random>

int main() {
    using namespace std;
    using namespace std::chrono;

    const unsigned N = 10000000;
    float* table1 = new float[N];
    float* table2 = new float[N];
    float* table3 = new float[N];

    std::mt19937 RND(123);
    std::uniform_real_distribution<float> dist(0,1);

    for (unsigned n = 0; n < N; ++n) { /*Initialize table1 and table2*/
        table1[n]=dist(RND);
        table2[n]=dist(RND);
    }

    auto start = duration_cast<milliseconds>(system_clock::now().time_since_epoch());
    
    for(unsigned k=0;k<500;k++) { /*Do inner loop a lot*/

        //#pragma omp parallel for
        //#pragma omp simd
        for (unsigned n = 0; n < N; ++n) /*VECTORIZE ME*/
        { 
            table3[n]=table1[n]+table2[n];
        }

    }

    auto end = duration_cast<milliseconds>(system_clock::now().time_since_epoch());
    std::cout << "Time " << end.count() - start.count() << std::endl;
        
    for (unsigned n = 0; n < N; ++n) { /*Use* the result.*/
        if (abs(table3[n]-(table1[n]+table2[n]))>0.01) {
            throw false;
        }
    }

    delete table1; delete table2; delete table3;
}

For a baseline, compiling g++ -o "openmp-sandpit" "openmp-sandpit.cpp" and running, yields a time of 14662ms and tops at 25% (I have I7 with four processors, and am running top with irix mode off).

Next we invoke -O1, -O2 and -O3, achieving 8524ms, 7473ms and 7376ms, respectively, all with top 25%.

  • Secondary Question #1 Has g++ made use of SIMD vectorization in achieving these optimizations?

Next we uncomment #pragma omp parallel for and compile with the -fopenmp, achieving 7553ms and a top of 100%. Additionally adding g++ optimization flags -O1, -O2 and -O3, achieves 8411ms, 7463ms and 7415ms, respectively, all with top just below 100%.

Notice that openmp on four cores (top 100%) achieves 7553ms which is worse than vanilla g++ at -O2 and -O3, and similar to -O1.

  • Secondary Question #2 Why is openmp,when using all four cores (top 100%), outperformed by optimized g++ on a single core (top 25%)?

Finally, replacing the comment//#pragma omp parallel for, uncommenting #pragma omp simd and compiling with the single option -fopenmp-simd achieves (a terrible) 15006ms with an expected top of 25%. Additionally adding g++ optimization flags -O1, -O2 and -O3, achieves 7911ms, 7350ms and 7364ms, respectively, all with top of 25%.

  • Main Question What is wrong with my openmp-simd code? Why is it not vectorizing?

If I could vectorize the inner n loop (openmp-simd), I could then parallelize the outer k loop (openmp), and should get a 2x-4x speed up for the outer loop (over the four cores), and a 4x-8x speed up for the inner loop (SIMD on each core), achieving a 8x-32x speed improvement. Surely?

[... It appears that g++ vectorization is turned on by default on -O3. This I have tested and verified ...]

[... The best result obtains, by avoiding openmp-simd. The following code uses openmp to split to 4 cores and the g++ auto vectorization.

#pragma omp parallel for
    for(int k=0;k<500;k++) { /*Do inner loop a lot*/

        for (int n = 0; n < N; ++n) /*VECTORIZE ME*/
        { 
            table3[n]=table1[n]+table2[n];
        }

    } 

Compiling g++-7 **-O3** **-march=native** **-fopenmp** (thanks to @Marc Glisse for -march=native) yields 3912. No other combination comes close. ...]



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source