'Why can adding padding make your loop faster?

People told me that adding padding can help to have better performance because it's using the cache in a better way.

I don't understand how is it possible that by making your data bigger you get better performance.

Can someone understand why?



Solution 1:[1]

Array padding

The array padding technique consists of increasing the size of the array dimensions in order to reduce conflict misses when accessing a cache memory. This type of miss can occur when the number of accessed elements mapping to the same set is greater than the degree of associativity of the cache. Padding changes the data layout and can be applied (1) between variables (Inter-Variable Padding) or (2) to a variable (Intra-Variable Padding):

1. Inter-Variable Padding

float x[LEN], padding[P], y[LEN];

float redsum() {
    float s = 0;

    for (int i = 0; i < LEN; i++)
        s = s + x[i] + y[i];

    return s;
}

If we have a direct mapped cache and the elements x[i] and y[i] are mapped into the same set, accesses to x will evict a block from y and vice versa, resulting in a high miss rate and low performance.

2. Intra-Variable Padding

float x[LEN][LEN+PAD], y[LEN][LEN];

void symmetrize() {
    for (int i = 0; i < LEN; i++) {
        for (int j = 0; j < LEN; j++)
            y[i][j] = 0.5 *(x[i][j] + x[j][i]);
    }
}

In this case, if the elements of a column are mapped into a small number of sets, their sequence of accesses may lead to conflict misses, so that the spatial locality would not be exploited.

For example, suppose that during the first iteration of the outer loop, the block containing x[0][0] x[0][1] ... x[0][15] is evicted to store the block containing the element x[k][0]. Then, at the start of the second iteration, the reference to x[0][1] would cause a cache miss.

This technical document analyses the performance of the Fast Fourier Transform (FFT) as a function of the size of the matrix used in the calculations:

https://www.intel.com/content/www/us/en/developer/articles/technical/fft-length-and-layout-advisor.html

References

Gabriel Rivera and Chau-Wen Tseng. Data transformations for eliminating conflict misses. PLDI 1998. DOI: https://doi.org/10.1145/277650.277661

Changwan Hong et al. Effective padding of multidimensional arrays to avoid cache conflict misses. PLDI 2016. DOI: https://doi.org/10.1145/2908080.2908123

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 chus