'Why second spin in Spinlock gives performance boost?

Here is a basic Spinlock implemented with std::atomic_flag.
The author of the book claims that second while in the lock() boosts performance.

class Spinlock
{
    std::atomic_flag flag{};
public:
    void lock() {
        while (flag.test_and_set(std::memory_order_acquire)) {
            while (flag.test(std::memory_order_acquire)); //Spin here
        }
    }
    void unlock() {
        flag.clear(std::memory_order_release);
    }
};

The reason we use test() in an extra inner loop is performance: test() doesn't invalidate cache line, whereas test_and_set() does.

Can someone please elaborate on this quote? Test is still a read operation and need to be read from memory right?

c++performance atomic

Solution 1:^[1]

An (atomic) store need exclusive access to a cache line, but a plain atomic (release) store is still relatively efficient because store buffer 'caching' is involved; ie. immediate access to the cache line is not required.

In the spinlock example, the store (test_and_set) is an atomic read-modify-write (RMW) operation, which is significantly slower due to its nature; it both reads and writes in a single operation.
That requires instant access to the cache line while anything still pending in the store buffer must be flushed before it can proceed. In addition, the cache line must be locked to ensure exclusive access between the read and write. This is expensive, but it's required for the spinlock to guarantee exclusive lock access.

When multiple threads are stomping on the same test_and_set operation, the cache line keeps bouncing between CPU cores.
The inner loop (with the atomic read) is an optimization because multiple threads/cores can have read-only access to the same cache line ('SHARED' in the MESI cache coherency protocol). Once the inner-loop test operation reads false, the test_and_set in the outer-loop acquires the lock and only at that point, std::memory_order_acquire is needed. Using 'acquire' on the test operation in the inner loop is therefore pointless and a pessimization on weaker architectures where memory_order_acquire entails an extra CPU instruction (eg. PowerPC).
Since the inner loop is all about optimization, it should use std::memory_order_relaxed.

    void lock() {
        while (flag.test_and_set(std::memory_order_acquire)) {
            while (flag.test(std::memory_order_relaxed)); //Spin here
        }
    }

To optimize further, a standalone fence could be used so that the CPU fencing instruction is only called after a thread has acquired the lock flag.

class Spinlock
{
    std::atomic_flag flag{};
public:
    void lock() {
        while (flag.test_and_set(std::memory_order_relaxed)) {
            while (flag.test(std::memory_order_relaxed)); //Spin here
        }
        std::atomic_thread_fence(std::memory_order_acquire);
    }
    void unlock() {
        flag.clear(std::memory_order_release);
    }
};

Is having an inner loop a useful optimization? - perhaps.. it won't hurt, but if the inner loop is frequently reached, there is some contention and a spin lock is probably not the right tool.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	LWimsey

'Why second spin in Spinlock gives performance boost?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]