'Why don't compilers merge redundant std::atomic writes?

I'm wondering why no compilers are prepared to merge consecutive writes of the same value to a single atomic variable, e.g.:

#include <atomic>
std::atomic<int> y(0);
void f() {
  auto order = std::memory_order_relaxed;
  y.store(1, order);
  y.store(1, order);
  y.store(1, order);
}

Every compiler I've tried will issue the above write three times. What legitimate, race-free observer could see a difference between the above code and an optimized version with a single write (i.e. doesn't the 'as-if' rule apply)?

If the variable had been volatile, then obviously no optimization is applicable. What's preventing it in my case?

Here's the code in compiler explorer.

c++multithreading c++11 compiler-optimization stdatomic

Solution 1:^[1]

You are referring to dead-stores elimination.

It is not forbidden to eliminate an atomic dead store but it is harder to prove that an atomic store qualifies as such.

Traditional compiler optimizations, such as dead store elimination, can be performed on atomic operations, even sequentially consistent ones.
Optimizers have to be careful to avoid doing so across synchronization points because another thread of execution can observe or modify memory, which means that the traditional optimizations have to consider more intervening instructions than they usually would when considering optimizations to atomic operations.
In the case of dead store elimination it isn’t sufficient to prove that an atomic store post-dominates and aliases another to eliminate the other store.

^{from N4455 No Sane Compiler Would Optimize Atomics}

The problem of atomic DSE, in the general case, is that it involves looking for synchronization points, in my understanding this term means points in the code where there is happen-before relationship between an instruction on a thread A and instruction on another thread B.

Consider this code executed by a thread A:

y.store(1, std::memory_order_seq_cst);
y.store(2, std::memory_order_seq_cst);
y.store(3, std::memory_order_seq_cst);

Can it be optimised as y.store(3, std::memory_order_seq_cst)?

If a thread B is waiting to see y = 2 (e.g. with a CAS) it would never observe that if the code gets optimised.

However, in my understanding, having B looping and CASsing on y = 2 is a data race as there is not a total order between the two threads' instructions.
An execution where the A's instructions are executed before the B's loop is observable (i.e. allowed) and thus the compiler can optimise to y.store(3, std::memory_order_seq_cst).

If threads A and B are synchronized, somehow, between the stores in thread A then the optimisation would not be allowed (a partial order would be induced, possibly leading to B potentially observing y = 2).

Proving that there is not such a synchronization is hard as it involves considering a broader scope and taking into account all the quirks of an architecture.

As for my understanding, due to the relatively small age of the atomic operations and the difficulty in reasoning about memory ordering, visibility and synchronization, compilers don't perform all the possible optimisations on atomics until a more robust framework for detecting and understanding the necessary conditions is built.

I believe your example is a simplification of the counting thread given above, as it doesn't have any other thread or any synchronization point, for what I can see, I suppose the compiler could have optimised the three stores.

Solution 2:^[2]

While you are changing the value of an atomic in one thread, some other thread may be checking it and performing an operation based on the value of the atomic. The example you gave is so specific that compiler developers don't see it worth optimizing. However, if one thread is setting e.g. consecutive values for an atomic: 0, 1, 2, etc., the other thread may be putting something in the slots indicated by the value of the atomic.

Solution 3:^[3]

NB: I was going to comment this but it's a bit too wordy.

One interesting fact is that this behavior isn't in the terms of C++ a data race.

Note 21 on p.14 is interesting: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf (my emphasis):

The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic

Also on p.11 note 5 :

“Relaxed” atomic operations are not synchronization operations even though, like synchronization operations, they cannot contribute to data races.

So a conflicting action on an atomic is never a data race - in terms of the C++ standard.

These operations are all atomic (and specifically relaxed) but no data race here folks!

I agree there's no reliable/predictable difference between these two on any (reasonable) platform:

include <atomic>
std::atomic<int> y(0);
void f() {
  auto order = std::memory_order_relaxed;
  y.store(1, order);
  y.store(1, order);
  y.store(1, order);
}

and

include <atomic>
std::atomic<int> y(0);
void f() {
  auto order = std::memory_order_relaxed;
  y.store(1, order);
}

But within the definition provided C++ memory model it isn't a data race.

I can't easily understand why that definition is provided but it does hand the developer a few cards to engage in haphazard communication between threads that they may know (on their platform) will statistically work.

For example, setting a value 3 times then reading it back will show some degree of contention for that location. Such approaches aren't deterministic but many effective concurrent algorithms aren't deterministic. For example, a timed-out try_lock_until() is always a race condition but remains a useful technique.

What it appears the C++ Standard is providing you with certainty around 'data races' but permitting certain fun-and-games with race conditions which are on final analysis different things.

In short the standard appears to specify that where other threads may see the 'hammering' effect of a value being set 3 times, other threads must be able to see that effect (even if they sometimes may not!). It's the case where pretty much all modern platforms that other thread may under some circumstances see the hammering.

Solution 4:^[4]

In short, because the standard (for example the paragaraphs around and below 20 in [intro.multithread]) disallows for it.

There are happens-before guarantees which must be fulfilled, and which among other things rule out reordering or coalescing writes (paragraph 19 even says so explicitly about reordering).

If your thread writes three values to memory (let's say 1, 2, and 3) one after another, a different thread may read the value. If, for example, your thread is interrupted (or even if it runs concurrently) and another thread also writes to that location, then the observing thread must see the operations in exactly the same order as they happen (either by scheduling or coincidence, or whatever reason). That's a guarantee.

How is this possible if you only do half of the writes (or even only a single one)? It isn't.

What if your thread instead writes out 1 -1 -1 but another one sporadically writes out 2 or 3? What if a third thread observes the location and waits for a particular value that just never appears because it's optimized out?

It is impossible to provide the guarantees that are given if stores (and loads, too) aren't performed as requested. All of them, and in the same order.

Solution 5:^[5]

A practical use case for the pattern, if the thread does something important between updates that does not depend on or modify y, might be: *Thread 2 reads the value of y to check how much progress Thread 1 has made.`

So, maybe Thread 1 is supposed to load the configuration file as step 1, put its parsed contents into a data structure as step 2, and display the main window as step 3, while Thread 2 is waiting on step 2 to complete so it can perform another task in parallel that depends on the data structure. (Granted, this example calls for acquire/release semantics, not relaxed ordering.)

I’m pretty sure a conforming implementation allows Thread 1 not to update y at any intermediate step—while I haven’t pored over the language standard, I would be shocked if it does not support hardware on which another thread polling y might never see the value 2.

However, that is a hypothetical instance where it might be pessimal to optimize away the status updates. Maybe a compiler dev will come here and say why that compiler chose not to, but one possible reason is letting you shoot yourself in the foot, or at least stub yourself in the toe.

Solution 6:^[6]

Let's walk a little further away from the pathological case of the three stores being immediately next to each other. Let's assume there's some non-trivial work being done between the stores, and that such work does not involve y at all (so that data path analysis can determine that the three stores are in fact redundant, at least within this thread), and does not itself introduce any memory barriers (so that something else doesn't force the stores to be visible to other threads). Now it is quite possible that other threads have an opportunity to get work done between the stores, and perhaps those other threads manipulate y and that this thread has some reason to need to reset it to 1 (the 2nd store). If the first two stores were dropped, that would change the behaviour.

Solution 7:^[7]

The compiler writer cannot just perform the optimisation. They must also convince themselves that the optimisation is valid in the situations where the compiler writer intends to apply it, that it will not be applied in situations where it is not valid, that it doesn't break code that is in fact broken but "works" on other implementations. This is probably more work than the optimisation itself.

On the other hand, I could imagine that in practice (that is in programs that are supposed to do a job, and not benchmarks), this optimisation will save very little in execution time.

So a compiler writer will look at the cost, then look at the benefit and the risks, and probably will decide against it.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Margaret Bloom
Solution 2	Serge Rogatch
Solution 3
Solution 4	Damon
Solution 5	Davislor
Solution 6	Andre Kostur
Solution 7	gnasher729

'Why don't compilers merge redundant std::atomic writes?

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Solution 5:[5]

Solution 6:[6]

Solution 7:[7]