'Explanation for strange peak in weak scaling plot

I'm running a simple kernel which adds two 3rd order tensors of double-precision complex-values. I've parallelized it using OpenMP with custom scheduling: the slice_indices container contains different indices for different threads.

    for (const auto& index : slice_indices)
    {
        auto* tens1_data_stream = tens1.get_slice_data(index);
        const auto* tens2_data_stream = tens2.get_slice_data(index);
        #pragma omp simd safelen(8)
        for (auto d_index = std::size_t{}; d_index < tens1.get_slice_size(); ++d_index)
        {
            tens1_data_stream[d_index].real += tens2_data_stream[d_index].real;
            tens1_data_stream[d_index].imag += tens2_data_stream[d_index].imag;
        }
    }

The target computer has a Intel(R) Xeon(R) Platinum 8168 CPU with

  1. 24 cores @ 2.70 GHz,
  2. L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB,
  3. the memory bandwidth is 115 GB/s (source: Intel Advisor's roofline plot).

Now the weak scaling plot: enter image description here

Notes:

  1. In the above, if N=90 then the data-size is 90*90*90*16 = 11,1237 MB.
  2. The sizes of the cache in terms of problem size are L1 ~ 12, L2 ~ 40 and L3 ~ 129.

Question 1: why is there an unexpected peak at N=90? In fact, why isn't the whole plot just a simple straight line?

If you suspect cache-effects, I would be grateful if you could explain in more detail.

I can provide more information if necessary. Thanks in advance.

EDIT:

Here's a weak-scaling plot with finer resolution: enter image description here

And here's a problem scaling plot with only 1 thread: enter image description here

Question 2: If I am initializing my data by only touching the first entry of each memory page i.e. my data has not been loaded to cache yet, why am I seeing the typical stepped performance graph?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source