'Explanation for strange peak in weak scaling plot
I'm running a simple kernel which adds two 3rd order tensors of double-precision complex-values. I've parallelized it using OpenMP with custom scheduling: the slice_indices container contains different indices for different threads.
for (const auto& index : slice_indices)
{
auto* tens1_data_stream = tens1.get_slice_data(index);
const auto* tens2_data_stream = tens2.get_slice_data(index);
#pragma omp simd safelen(8)
for (auto d_index = std::size_t{}; d_index < tens1.get_slice_size(); ++d_index)
{
tens1_data_stream[d_index].real += tens2_data_stream[d_index].real;
tens1_data_stream[d_index].imag += tens2_data_stream[d_index].imag;
}
}
The target computer has a Intel(R) Xeon(R) Platinum 8168 CPU with
- 24 cores @ 2.70 GHz,
- L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB,
- the memory bandwidth is 115 GB/s (source: Intel Advisor's roofline plot).
Notes:
- In the above, if
N=90then the data-size is90*90*90*16= 11,1237 MB. - The sizes of the cache in terms of problem size are L1 ~ 12, L2 ~ 40 and L3 ~ 129.
Question 1: why is there an unexpected peak at N=90? In fact, why isn't the whole plot just a simple straight line?
If you suspect cache-effects, I would be grateful if you could explain in more detail.
I can provide more information if necessary. Thanks in advance.
EDIT:
Here's a weak-scaling plot with finer resolution:

And here's a problem scaling plot with only 1 thread:

Question 2: If I am initializing my data by only touching the first entry of each memory page i.e. my data has not been loaded to cache yet, why am I seeing the typical stepped performance graph?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|

