'OpenMP-behavior: Using ICC and GCC gives significantly different run times

For a small benchmark of OpenMP on an i7-6700K I wrote the following code:

#include <iostream>
#include <omp.h>
#include <vector>
#include <chrono>

constexpr int bench_rounds = 32;

int main(void) {
    using std::chrono::high_resolution_clock;
    using std::chrono::duration_cast;
    using std::chrono::duration;
    using std::chrono::milliseconds;

    
    size_t vec_size = 16;
    size_t large_vec_size = 1024*16;
    std::vector<double> test_vec(large_vec_size * large_vec_size, 0);

    auto t1 = high_resolution_clock::now(); 
    for(int k = 0; k < bench_rounds; ++k) {
    #pragma omp parallel for collapse(2)
    for(int j = 0; j < large_vec_size; ++j) {
        for(int i = 0; i < large_vec_size; ++i) {
            test_vec[j * large_vec_size + i] = i + j + test_vec[j * large_vec_size + i] * 1e-13;
        }
    }
    }
    auto t2 = high_resolution_clock::now();
    duration<double, std::milli> ms_double = t2 - t1;
    std::cout << ms_double.count() << "ms\n";

    return 0;
}

I compiled it using four different approaches:

  1. With the latest intel compiler, using icpc main.cpp -o test
  2. With the latest intel compiler, using icpc -qopenmp main.cpp -o test -liomp5
  3. With GCC 11.2, using g++ main.cpp -o test
  4. With GCC 11.2, using g++ -fopenmp main.cpp -o test -lgomp

My obtained results were:

  1. Warning "unrecognized OpenMP #pragma #pragma omp parallel for collapse(2)", runtime: 2490 ms
  2. No warning, runtime: 14080 ms
  3. No warning, runtime: 45550 ms
  4. No warning, runtime: 13400 ms

The result for GCC is as more or less as expected, I am running it on four cores, and my speed-up is slightly larger than three. But for the intel compiler I do not understand the result: Why is it so much faster, especially when disregarding the OpenMP-pragma?

To extend the benchmark data after requests in the comments, my compile lines are

g++ main.cpp -o test_gcc_clean
g++ -fopenmp main.cpp -o test_gcc_omp -lgomp
g++ -fopenmp -march=native -mavx -O3 main.cpp -o test_gcc_opt -lgomp
icpc main.cpp -o test_icc_clean
icpc -qopenmp main.cpp -o test_icc_omp -liomp5
icpc -qopenmp -march=native -mavx -O3 main.cpp -o test_icc_opt -liomp5

and my run file is:

echo "Clean GCC"
./test_gcc_clean
echo "GCC with OpenMP"
./test_gcc_omp
echo "Optimized GCC"
./test_gcc_opt
echo "Clean ICC"
./test_icc_clean
echo "ICC with OpenMP"
./test_icc_omp
echo "Optimized ICC"
./test_icc_opt

The output then is:

Clean GCC
45641.6ms
GCC with OpenMP
13358.6ms
Optimized GCC
4949.53ms
Clean ICC
2471.96ms
ICC with OpenMP
14014.6ms
Optimized ICC
13662.9ms

reflecting my earlier results. Interestingly, a program compiled with

icpc -march=native -mavx -O3 main.cpp -o test_icc_nomp

will be even faster, and have a runtime of 1286 ms, but will throw an error during compilation stating that it does not know the OpenMP pragma.

Edit:
Following the suggestions in the answer I decided to extend the question for the sake of completeness. It should test the assumption that the comparison between size_t and int slows down the code, and a final verification to avoid optimization removal of the test vectors. Therefore, I used the following code:

#include <iostream>
#include <omp.h>
#include <vector>
#include <chrono>
#include <algorithm>

constexpr int bench_rounds = 32;

int main(void) {
    using std::chrono::high_resolution_clock;
    using std::chrono::duration_cast;
    using std::chrono::duration;
    using std::chrono::milliseconds;

    
    size_t vec_size = 16;
    size_t large_vec_size = 1024*16;
    std::vector<double> test_vec(large_vec_size * large_vec_size, 0),
    test_vec_II(large_vec_size * large_vec_size, 0),
    test_vec_III(large_vec_size * large_vec_size, 0),
    test_vec_IV(large_vec_size * large_vec_size, 0);

    auto t1 = high_resolution_clock::now(); 
    for(int k = 0; k < bench_rounds; ++k) {
    #pragma omp parallel for collapse(2)
    for(int j = 0; j < large_vec_size; ++j) {
        for(int i = 0; i < large_vec_size; ++i) {
            test_vec[j * large_vec_size + i] = i + j + test_vec[j * large_vec_size + i] * 1e-13;
        }
    }
    }
    auto t2 = high_resolution_clock::now();

    auto t3 = high_resolution_clock::now(); 
    for(int k = 0; k < bench_rounds; ++k) {
    #pragma omp parallel for
    for(int j = 0; j < large_vec_size; ++j) {
        #pragma omp simd
        for(int i = 0; i < large_vec_size; ++i) {
            test_vec_II[j * large_vec_size + i] = i + j + test_vec_II[j * large_vec_size + i] * 1e-13;
        }
    }
    }
    auto t4 = high_resolution_clock::now();

    auto t5 = high_resolution_clock::now(); 
    for(int k = 0; k < bench_rounds; ++k) {
    #pragma omp parallel for collapse(2)
    for(size_t j = 0; j < large_vec_size; ++j) {
        for(size_t i = 0; i < large_vec_size; ++i) {
            test_vec_III[j * large_vec_size + i] = i + j + test_vec_III[j * large_vec_size + i] * 1e-13;
        }
    }
    }
    auto t6 = high_resolution_clock::now();

    auto t7 = high_resolution_clock::now(); 
    for(int k = 0; k < bench_rounds; ++k) {
    #pragma omp parallel for
    for(size_t j = 0; j < large_vec_size; ++j) {
        #pragma omp simd
        for(size_t i = 0; i < large_vec_size; ++i) {
            test_vec_IV[j * large_vec_size + i] = i + j + test_vec_IV[j * large_vec_size + i] * 1e-13;
        }
    }
    }
    auto t8 = high_resolution_clock::now();


    duration<double, std::milli> ms_double = t2 - t1, 
    ms_double_simd = t4 - t3, 
    ms_double_sizet = t6 - t5, 
    ms_double_simd_sizet = t8 - t7;
    std::cout << "Coll: " << ms_double.count() << " ms\n";
    std::cout << "SIMD: " << ms_double_simd.count() << " ms\n";
    std::cout << "CoST: " << ms_double_sizet.count() << " ms\n";
    std::cout << "SIST: " << ms_double_simd_sizet.count() << " ms\n";

    std::cout << "Vectors are equal: ";
    if(std::equal(test_vec.begin(), test_vec.begin() + large_vec_size * large_vec_size, test_vec_II.begin())) {
        std::cout << "True\n";
    } else {
        std::cout << "False\n";
    }
    std::cout << "Vectors are equal: ";
    if(std::equal(test_vec.begin(), test_vec.begin() + large_vec_size * large_vec_size, test_vec_III.begin())) {
        std::cout << "True\n";
    } else {
        std::cout << "False\n";
    }
    std::cout << "Vectors are equal: ";
    if(std::equal(test_vec.begin(), test_vec.begin() + large_vec_size * large_vec_size, test_vec_IV.begin())) {
        std::cout << "True\n";
    } else {
        std::cout << "False\n";
    }

    return 0;
}

and obtained the following results:

Clean GCC
Coll: 46281.8 ms
SIMD: 47917.9 ms
CoST: 44322 ms
SIST: 44275.4 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
GCC with OpenMP
Coll: 13799.6 ms
SIMD: 14546 ms
CoST: 12913.8 ms
SIST: 13113.1 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
Optimized GCC
Coll: 4955.54 ms
SIMD: 5080.45 ms
CoST: 5203.64 ms
SIST: 5011.17 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
Optimized GCC, no OpenMP
Coll: 5201.49 ms
SIMD: 5198.48 ms
CoST: 6148.23 ms
SIST: 6279.94 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
Clean ICC
Coll: 2579.12 ms
SIMD: 5315.75 ms
CoST: 5296.52 ms
SIST: 6892.02 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
ICC with OpenMP
Coll: 14466 ms
SIMD: 4974.81 ms
CoST: 13539.5 ms
SIST: 4963.63 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
Optimized ICC
Coll: 15753.4 ms
SIMD: 5114.96 ms
CoST: 13509.4 ms
SIST: 5100.88 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
Optimized ICC, no OpenMP
Coll: 1302.34 ms
SIMD: 5200.3 ms
CoST: 5535.02 ms
SIST: 5565.15 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True

At least on my platform I get some interesting results:

  • A loop with int and size_t can be optimized better by ICC (and in some cases for GCC) than a loop with only size_t
  • #pragma simd is faster than #pragma omp collapse(), but the version without OpenMP is still significantly faster, especially for ICC


Solution 1:[1]

The problem comes from collapse(2) clause and is also related to the code auto-vectorization. Indeed, both compilers are not able to auto-vectorize the loop with the collapse, but ICC use a very expensive idiv instruction in the middle of the hot loop (which is very bad) while GCC produce a better code. This comes from the collapse(2) clause which is not well optimized (on many compilers).

You can see that on Godbolt. Note that optimizing a kernel using a collapse(2) clause is not easy since the compiler do not know the bound of the loops and so the associated cost (as well as the divider for the modulus).

Without the collapse(2), GCC is able to vectorize the loop successfully but surprisingly not ICC. Fortunately, we can help ICC using the simd directive. Once used, the two compilers generate a relatively good code. It is still not optimal because size_t is generally 8 bytes and int is 4 bytes on mainstream x86-64 platforms and the loop comparison of the different types makes the code harder to vectorize efficiently as well as to produce the best scalar instructions. You can use a temporary variable to fix that. You can see the resulting assembly code here.

Note that the assembly generated by ICC is very good once the code is fixed. The code is memory bound and the final code should saturate the RAM with only few threads. Even the L3 cache should be saturated with the ICC produced assembly if the input array would fit into it.

Here is the fixed code:

for(int k = 0; k < bench_rounds; ++k) {
    int limit = large_vec_size;
    #pragma omp parallel for
    for(int j = 0; j < limit; ++j) {
        #pragma omp simd
        for(int i = 0; i < limit; ++i) {
           test_vec[j * large_vec_size + i] = i + j + test_vec[j * large_vec_size + i] * 1e-13;
        }
    }
}

Moreover, not that the result is not used and optimizing compilers can just remove the benchmark code because it does not actually have any side effects (dead-code elimination). Generally, you can fix that by just printing the result. This is also useful to check the results.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Peter Cordes