'2D blocking with unique matrix transpose problem
I have complex valued data of type struct complex {double real = 0.0; double imag = 0.0;}; organized in the form of an order-3 tensor. The underlying container has a contiguous memory layout aligned with the memory-page boundary.
The natural 'slicing' direction of the tensor is along direction 1. This means the cache-lines extend in directions 3, 2 and finally 1, in that order. In other words, the indexing function looks like this: (i, j, k) -> i * N2 * N3 + j * N3 + k.

I am required to transpose the slices along direction 2. In the first of the above images, the rectangle in red is the slice of the tensor which I wish to have transposed.
My code in C++ looks like this:
for (auto vslice_counter = std::size_t{}; vslice_counter < n2; ++vslice_counter)
{
// blocked loop
for (auto bi1 = std::size_t{}; bi1 < n1; bi1 += block_size)
{
for (auto bi3 = std::size_t{}; bi3 < n3; bi3 += block_size)
{
for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
{
for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
{
const auto i1_block = bi1 + i1;
const auto i3_block = bi3 + i3;
tens_tr(i3_block, vslice_counter, i1_block) = tens(i1_block, vslice_counter, i3_block);
}
}
}
}
}
Machine used for testing: dual socket Intel(R) Xeon(R) Platinum 8168 with
- 24 cores @ 2.70 GHz and
- L1, L2 and L3 caches sized 32 kB, 1 MB and 33 MB respectively.
I plotted the graph of the performance of this function w.r.t block-size but I was surprised to see that there is no variation! In fact, the naive implementation performs just as well as this one.
Question: Is there a reason why 2D blocking isn't helping boost the performance in this case?
Edit:
Additional information:
- The compiler used is the Intel C++ compiler.
block_sizeis an input parameter and not a compile time constant.- During testing, I varied the value of
block_sizefrom 4 to 256. This translates to blocks of 256 B and 1 MB respectively since the blocks are squares of side-lengthblock_size. My aim was to fill up the L1 and/or L2 caches with the blocks. - Compilation flags used for optimization:
-O3;-ffast-math;-march=native;-qopt-zmm-usage=high
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
