'When should we use prefetch?

Some CPU and compilers supply prefetch instructions. Eg: __builtin_prefetch in GCC Document. Although there is a comment in GCC's document, but it's too short to me.

I want to know, in practice, when should we use prefetch? Are there some examples?

Solution 1:^[1]

On recent Intel chips one reason you apparently might want to use prefetching is to avoid CPU power-saving features artificially limiting your achieved memory bandwidth. In this scenario, simple prefetching can as much as double your performance versus the same code without prefetching, but it depends entirely on the selected power management plan.

I ran a simplified version (code here)of the test in Leeor's answer, which stresses the memory subsystem a bit more (since that's where prefetch will help, hurt or do nothing). The original test stressed the CPU in parallel with the memory subsystem since it added together every int on each cache line. Since typical memory read bandwidth is in the region of 15 GB/s, that's 3.75 billion integers per second, putting a pretty hard cap on the maximum speed (code that isn't vectorized will usually process 1 int or less per cycle, so a 3.75 GHz CPU will be about equally CPU and memory bount).

First, I got results that seemed to show prefetching kicking butt on my i7-6700HQ (Skylake):

Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=221, MiB/s=11583
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=221, MiB/s=11583
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=204, MiB/s=12549
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=200, MiB/s=12800
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=201, MiB/s=12736
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=197, MiB/s=12994
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305

Eyeballing the numbers, prefetch achieves something a bit above 16 GiB/s and without only about 12.5, so prefetch is increasing speed by about 30%. Right?

Not so fast. Remembering that the powersaving mode has all sorts of wonderful interactions on modern chips, I changed my Linux CPU governor to performance from the default of powersave¹. Now I get:

Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=155, MiB/s=16516
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=144, MiB/s=17777
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=144, MiB/s=17777
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=152, MiB/s=16842
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=159, MiB/s=16100
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=163, MiB/s=15705
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=161, MiB/s=15900

It's a total toss-up. Both with and without prefetching seem to perform identically. So either hardware prefetching is less aggressive in the high powersaving modes, or there is some other interaction with power saving that behaves differently with the explicit software prefetches.

Investigation

In fact, the difference between prefetching and not is even more extreme if you change the benchark. The existing benchmark alternates between runs with prefetching on and off, and it turns out that this helped the "off" variant because the speed increase which occurs in the "on" test partly carries over to the subsequent off test². If you run only the "off" test you get results around 9 GiB/s:

Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=280, MiB/s=9142
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=277, MiB/s=9241
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=285, MiB/s=8982

... versus about 17 GiB/s for the prefetching version:

Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=149, MiB/s=17181
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=148, MiB/s=17297
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=148, MiB/s=17297

So the prefetching version is almost twice as fast.

Let's take a look at what's going on with perf stat, for both the **off* version:

Performance counter stats for './prefetch-test off':

   2907.485684      task-clock (msec)         #    1.000 CPUs utilized                                          
 3,197,503,204      cycles                    #    1.100 GHz                    
 2,158,244,139      instructions              #    0.67  insns per cycle        
   429,993,704      branches                  #  147.892 M/sec                  
        10,956      branch-misses             #    0.00% of all branches

... and the on version:

   1502.321989      task-clock (msec)         #    1.000 CPUs utilized                          
 3,896,143,464      cycles                    #    2.593 GHz                    
 2,576,880,294      instructions              #    0.66  insns per cycle        
   429,853,720      branches                  #  286.126 M/sec                  
        11,444      branch-misses             #    0.00% of all branches

The difference is that the version with prefetching on consistently runs at the max non-turbo frequency of ~2.6 GHz (I have disabled turbo via an MSR). The version without prefetching, however, has decided to run at a much lower speed of 1.1 GHz. Such large CPU differences often also reflect a large difference in uncore frequency, which can explain the worse bandwdith.

Now we've seen this before, and it is probably an outcome of the Energy Efficient Turbo feature on recent Intel chips, which try to ramp down the CPU frequency when they determine a process is mostly memory bound, presumably since increased CPU core speed doesn't provide much benefit in those cases. As we can see here, this assumption isn't always true, but it isn't clear to me if the tradeoff is a bad one in general, or perhaps the heuristic only occasionally gets it wrong.

¹ I'm running the intel_pstate driver, which is the default for Intel chips on recent kernels which implements "hardware p-states", also known as "HWP". Command used: sudo cpupower -c 0,1,2,3 frequency-set -g performance.

² Conversely, the slowdown from the "off" test partly carries over into the "on" test, although the effect is less extreme, possibly because the powersaving "ramp up" behavior is faster than "ramp down".

Solution 2:^[2]

Here's a brief summary of cases that I'm aware of in which software prefetching may prove especially useful. Some may not apply to all hardware.

This list should be read from the point of view that the most obvious place software prefetches could be used is where the stream of accesses can be predicted in software, and yet this case isn't necessarily such an obvious win for SW prefetch because out-of-order processing often ends up having a similar effect since it can execute behind existing misses in order to get more misses in flight.

So this list is more a "in light of the fact that SW prefetch isn't as obviously useful as it might first seem, here are some places it might still be useful anyways", often compared to the alternative of either just letting out-of-order processing do its thing or just using "plain loads" to load some values before they are needed.

Fitting more loads in the out-of-order window

Although out-of-order processing can potentially expose the same type of MLP (Memory-Level Parallelism) as software prefetches, there are limits inherent to the total possible lookahead distance after a cache miss. These include reorder-buffer capacity, load buffer capacity, scheduler capacity and so on. See this blog post for an example of where extra work seriously hinders MLP because the CPU can't run ahead far enough to get enough loads executing at once.

In this case, software prefetch allows you to effectively stuff more loads earlier in the instruction stream. As an example, imagine you have a loop which performs one load and then 20 instructions worth of work on the loaded data, and your CPU has an out-of-order buffer of 100 instructions and that loads are independent from each other (e.g,. accessing an array with a known stride).

After the first miss, you can run ahead 99 more instructions which will be composed of 95 non-load and 5 load instructions (including the first load). So your MLP is inherently limited to 5 by the size of the out-of-order buffer. If instead you paired every load with two software prefetches to a location say 6 or more iterations ahead, you'd end up instead with 90 non-load instructions, 5 loads and 5 software prefetches and since all those loads are you just doubled your MLP to 10².

There is of course no limit of one additional prefetch per load: you could add more to hit higher numbers, but there is a point of diminishing and then negative returns as you hit the MLP limits of your machine and the prefetches take up resources you'd rather spend on other things.

This is similar to software pipelining, where you load data for a future iteration, and then don't touch that register until after a significant amount of other work. This was mostly used on in-order machines to hide latency of computation as well as memory. Even on a RISC with 32 architectural registers, software-pipelining typically can't place the loads as far ahead of use as an optimal prefetch-distance on a modern machine; the amount of work a CPU can do during one memory latency has grown a lot since the early days of in-order RISCs.

In-order machines

Not all machines are bit out-of-order cores: in-order CPUs are still common in some places (especially outside x86), and you'll also find "weak" out of order cores that don't have the capability to run ahead very far and so partly act like in-order machines.

On these machines software prefetches may help gain MLP that you wouldn't otherwise be able access (of course, an in-order machine probably doesn't support a lot of inherent MLP otherwise).

Working around hardware prefetch restrictions

Hardware prefetch may have restrictions which you could work around using software prefetch.

For example, Leeor's answer has an example of hardware prefetch stopping at page boundaries, while software prefetch doesn't have any such restriction.

Another example might be any time that hardware prefetch is too aggressive or too conservative (after all it has to guess at your intentions): you might use software prefetch instead since you know exactly how your application will behave.

Examples of the latter include prefetching discontiguous areas: such as rows in a sub-matrix of a larger matrix: hardware prefetch won't understand the boundaries of the "rectangular" region and will constantly prefetch beyond the end of each row, and then take a bit of time to pick up the new row pattern. Software prefetching can get this exactly right: never issuing any useless prefetches at all (but it often requires ugly splitting of loops).

If you do enough software prefetches, the hardware prefeteches should in theory mostly shut down, because the activity of the memory subsystem is one heuristic they use to decide whether to activate.

Counterpoint

I should note here that software prefetching is not equivalent to hardware prefetching when it comes to possible speedups for cases the hardware prefetching can pick up: hardware prefetching can be considerably faster. That is because hardware prefetching can start working closer to memory (e.g., from the L2) where it has a lower latency to memory and also access to more buffers (in the so-called "superqueue" on Intel chips) and so more concurrency. So if you turn off hardware prefetching and try to implement a memcpy or some other streaming load with pure software prefetching, you'll find that it is likely slower.

Special load hints

Prefetching may give you access to special hints that you can't achieve with regular loads. For example x86 has the prefetchnta, prefetcht0, prefetcht1, and prefetchw instructions which hint to the processor how to treat the loaded data in the caching subsystem. You can't achieve the same effect with plain loads (at least on x86).

² It's not actually as simple as just adding a single prefetch to the loop, since after the first five iterations, the loads will start hitting already prefetched values, reducing your MLP back to 5 - but the idea still holds. A real implementation would also involve reorganizing the loop so that the MLP can be sustained (e.g., "jamming" the loads and prefetches together every few iterations).

Solution 3:^[3]

There are definitely situations where software prefetch provides significant performance improvements.

For example, if you are accessing a relatively slow memory device such as Optane DC Persistent Memory, which has access times of several hundred nanoseconds, prefetching can reduce effective latency by 50 percent or more if you can do it far enough in advance of the read or write.

This isn't a very common case at present but it will become a lot more common if and when such storage devices become mainstream.

Solution 4:^[4]

The article 'What Every Programmer Should Know About Memory Ulrich Drepper' discusses situations where pre-fetching is advantageous; http://www.akkadia.org/drepper/cpumemory.pdf , warning: this is quite a long article that discusses things like memory architecture / how the cpu works, etc.

prefetching gives something if the data is aligned to cache lines; and if you are loading data that is about to be accessed by the algorithm;

In any event one should do this when trying to optimize highly used code; benchmarking is a must and things usually work out differently than one might use to think.

Solution 5:^[5]

It seems, that the best policy to follow is to never use __builtin_prefetch (and its friend, __builtin_expect) at all. On some platforms those may help (and even help a lot) - however, one must always do some benchmarking to confirm this. The real question, whether the short term performance gains will worth the trouble in the longer run.

First, one may ask the following question: what these statements actually do when fed to a higher end modern CPU? The answer is: nobody really knows (except, may be, few guys on the CPU's core architecture team, but they are not going to tell anybody). Modern CPUs are very complex machines, capable of instruction reordering, speculative execution of instructions across possibly not taken branches, etc., etc. Moreover, the details of this complex behavior may (and will) differ considerably between CPU generations and vendors (Intel Core vs Intel I* vs AMD Opteron; with more fragmented platforms like ARM the situation is even worse).

One neat example (not prefetch related, but still) of CPU functionality which used to speed things up on older Intel CPUs, but sucks badly on the more modern one is outlined here: http://lists-archives.com/git/744742-git-gc-speed-it-up-by-18-via-faster-hash-comparisons.html. In that particular case, it was possible to achieve 18% performance increase by replacing the optimized version of gcc supplied memcmp with an explicit ("naive" so to say) loop.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2
Solution 3	s. heller
Solution 4	MichaelMoser
Solution 5	oakad

'When should we use prefetch?

Solution 1:[1]