'Explanation for strange peak in weak scaling plot

I'm running a simple kernel which adds two 3rd order tensors of double-precision complex-values. I've parallelized it using OpenMP with custom scheduling: the slice_indices container contains different indices for different threads.

    for (const auto& index : slice_indices)
    {
        auto* tens1_data_stream = tens1.get_slice_data(index);
        const auto* tens2_data_stream = tens2.get_slice_data(index);
        #pragma omp simd safelen(8)
        for (auto d_index = std::size_t{}; d_index < tens1.get_slice_size(); ++d_index)
        {
            tens1_data_stream[d_index].real += tens2_data_stream[d_index].real;
            tens1_data_stream[d_index].imag += tens2_data_stream[d_index].imag;
        }
    }

The target computer has a Intel(R) Xeon(R) Platinum 8168 CPU with

24 cores @ 2.70 GHz,
L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB,
the memory bandwidth is 115 GB/s (source: Intel Advisor's roofline plot).

Now the weak scaling plot:

Notes:

In the above, if N=90 then the data-size is 90*90*90*16 = 11,1237 MB.
The sizes of the cache in terms of problem size are L1 ~ 12, L2 ~ 40 and L3 ~ 129.

Question 1: why is there an unexpected peak at N=90? In fact, why isn't the whole plot just a simple straight line?

If you suspect cache-effects, I would be grateful if you could explain in more detail.

I can provide more information if necessary. Thanks in advance.

EDIT:

Here's a weak-scaling plot with finer resolution:

And here's a problem scaling plot with only 1 thread:

Question 2: If I am initializing my data by only touching the first entry of each memory page i.e. my data has not been loaded to cache yet, why am I seeing the typical stepped performance graph?

c++parallel-processing openmp scalability hpc

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Explanation for strange peak in weak scaling plot

Sources

Related Questions