'Using one loop vs two loops

I was reading this blog :- https://developerinsider.co/why-is-one-loop-so-much-slower-than-two-loops/. And I decided to check it out using C++ and Xcode. So, I wrote a simple program given below and when I executed it, I was surprised by the result. Actually the 2nd function was slower compared to the first function contrary to what is stated in the article. Can anyone please help me figure out why this is the case?

#include <iostream>
#include <vector>
#include <chrono>
    
using namespace std::chrono;
    
void function1() {
    const int n=100000;
            
    int a1[n], b1[n], c1[n], d1[n];
            
    for(int j=0;j<n;j++){
        a1[j] = 0;
        b1[j] = 0;
        c1[j] = 0;
        d1[j] = 0;
    }
            
    auto start = high_resolution_clock::now();
        
    for(int j=0;j<n;j++){
        a1[j] += b1[j];
        c1[j] += d1[j];
    }
            
    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>(stop - start);
        
    std::cout << duration.count() << " Microseconds." << std::endl;  
}
    
void function2() {
    const int n=100000;
            
    int a1[n], b1[n], c1[n], d1[n];
            
    for(int j=0; j<n; j++){
        a1[j] = 0;
        b1[j] = 0;
        c1[j] = 0;
        d1[j] = 0;
    }
            
    auto start = high_resolution_clock::now();
            
    for(int j=0; j<n; j++){
        a1[j] += b1[j];
    }
    
    for(int j=0;j<n;j++){
        c1[j] += d1[j];
    }
            
    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>(stop - start);
        
    std::cout << duration.count() << " Microseconds." << std::endl;
}
        
int main(int argc, const char * argv[]) {
    function1();
    function2();
    
    return 0;
}


Solution 1:[1]

TL;DR: The loops are basically the same, and if you are seeing differences, then your measurement is wrong. Performance measurement and more importantly, reasoning about performance requires a lot of computer knowledge, some scientific rigor, and much engineering acumen. Now for the long version...


Unfortunately, there is some very inaccurate information in the article to which you've linked, as well as in the answers and some comments here.

Let's start with the article. There won't be any disk caching that has any effect on the performance of these functions. It is true that virtual memory is paged to disk, when demand on physical memory exceeds what's available, but that's not a factor that you have to consider for programs that touch 1.6MB of memory (4 * 4 * 100K).

And if paging comes into play, the performance difference won't exactly be subtle either. If these arrays were paged to disk and back, the performance difference would be in order of 1000x for fastest disks, not 10% or 100%.

Paging and page faults and its effect on performance is neither trivial, nor intuitive. You need to read about it, and experiment with it seriously. What little information that article has is completely inaccurate to the point of being misleading.

The second is your profiling strategy and the micro-benchmark itself. Clearly, with such simple operations on data (an add,) the bottleneck will be memory bandwidth itself (maybe instruction retire limits or something like that with such a simple loop.) And since you only read memory linearly, and use all you read, whether its in 4 interleaving streams or 2, you are making use of all the bandwidth that is available.

However, if you call your function1 or function2 in a loop, you will be measuring the bandwidth of different parts of the memory hierarchy depending on N, from L1 all the way to L3 and main memory. (You should know the size of all levels of cache on your machine, and how they work.) This is obvious if you know how CPU caches work, and really mystifying otherwise. Do you want to know how fast this is when you do it the first time, when the arrays are cold, or do you want to measure the hot access?

Is your real use case copying the same mid-sized array over and over again?

If not, what is it? What are you benchmarking? Are you trying to measure something or just experimenting?

Shouldn't you be measuring the fastest run through a loop, rather than the average since that can be massively affected by a (basically random) context switch or an interrupt?

Have you made sure you are using the correct compiler switches? Have you looked at the generated assembly code to make sure the compiler is not adding debug checks and what not, and is not optimizing stuff away that it shouldn't (after all, you are just executing useless loops, and an optimizing compiler wants nothing more than to avoid generating code that is not needed).

Have you looked at the theoretical memory/cache bandwidth number for your hardware? Your specific CPU and RAM combination will have theoretical limits. And be it 5, 50, or 500 GiB/s, it will give you an upper bound on how much data you can move around and work with. The same goes with the number of execution units, the IPC or your CPU, and a few dozen other numbers that will affect the performance of this kind of micro-benchmark.

If you are reading 4 integers (4 bytes each, from a, b, c, and d) and then doing two adds and writing the two results back, and doing it 100'000 times, then you are - roughly - looking at 2.4MB of memory read and write. If you do it 10 times in 300 micro-seconds, then your program's memory (well, store buffer/L1) throughput is about 80 GB/s. Is that low? Is that high? Do you know? (You should have a rough idea.)

And let me tell you that the other two answers here at the time of this writing (namely this and this) do not make sense. I can't make heads nor tails of the first one, and the second one is almost completely wrong (conditional branches in a 100'000-times for loop are bad? allocating an additional iterator variable is costly? cold access to array on stack vs. on the heap has "serious performance implications?)

And finally, as written, the two functions have very similar performances. It is really hard separating the two, and unless you can measure a real difference in a real use case, I'd say write whichever one that makes you happier.

If you really really want a theoretical difference between them, I'd say the one with two separate loops is very slightly better because it is usually not a good idea interleaving access to unrelated data.

Solution 2:[2]

This has nothing to do with caching or instruction efficiency. Simple iterations over long vectors are purely a matter of bandwidth. (Google: stream benchmark.) And modern CPUs have enough bandwidth to satisfy not all of their cores, but a good deal.

So if you combine the two loops, executing them on a single core, there is probably enough bandwidth for all loads and stores at the rate that memory can sustain. But if you use two loops, you leave bandwidth unused, and the runtime will be a little less than double.

Solution 3:[3]

The reasons why the second is faster in your case (I do not think that this works on any machine) is better cpu caching at the point at ,which you cpu has enough cache to store the arrays, the stuff your OS requires and so on, the second function will probably be much slower than the first. from a performance standpoint. I doubt that the two loop code will give better performance if there are enough other programs running as well, because the second function has obviously worse efficiency then the first and if there is enough other stuff cached the performance lead throw caching will be eliminated.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3