'2D blocking with unique matrix transpose problem

I have complex valued data of type struct complex {double real = 0.0; double imag = 0.0;}; organized in the form of an order-3 tensor. The underlying container has a contiguous memory layout aligned with the memory-page boundary.

The natural 'slicing' direction of the tensor is along direction 1. This means the cache-lines extend in directions 3, 2 and finally 1, in that order. In other words, the indexing function looks like this: (i, j, k) -> i * N2 * N3 + j * N3 + k. enter image description here enter image description here

I am required to transpose the slices along direction 2. In the first of the above images, the rectangle in red is the slice of the tensor which I wish to have transposed.

My code in C++ looks like this:

for (auto vslice_counter = std::size_t{}; vslice_counter < n2; ++vslice_counter)
    {        
        // blocked loop
        for (auto bi1 = std::size_t{}; bi1 < n1; bi1 += block_size)
        {
            for (auto bi3 = std::size_t{}; bi3 < n3; bi3 += block_size)
            {
                for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
                {
                    for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
                    {
                        const auto i1_block = bi1 + i1;
                        const auto i3_block = bi3 + i3;
                        tens_tr(i3_block, vslice_counter, i1_block) = tens(i1_block, vslice_counter, i3_block);
                    }
                }
            }
        }
    }

Machine used for testing: dual socket Intel(R) Xeon(R) Platinum 8168 with

  1. 24 cores @ 2.70 GHz and
  2. L1, L2 and L3 caches sized 32 kB, 1 MB and 33 MB respectively.

I plotted the graph of the performance of this function w.r.t block-size but I was surprised to see that there is no variation! In fact, the naive implementation performs just as well as this one.

Question: Is there a reason why 2D blocking isn't helping boost the performance in this case?


Edit:

Additional information:

  1. The compiler used is the Intel C++ compiler.
  2. block_size is an input parameter and not a compile time constant.
  3. During testing, I varied the value of block_size from 4 to 256. This translates to blocks of 256 B and 1 MB respectively since the blocks are squares of side-length block_size. My aim was to fill up the L1 and/or L2 caches with the blocks.
  4. Compilation flags used for optimization: -O3;-ffast-math;-march=native;-qopt-zmm-usage=high


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source