While it gives examples of all alternatives, this blog post is not clear enough in explaining the differences between them.
For releasing a lock, it is obvious enough that a single atomic store is sufficient. All CPUs provide store instructions that are atomic when certain restrictions on size and alignment are observed.
So the only part that is tricky is acquiring the lock. A loop is needed to wait when the lock is owned by another thread, until the lock is released.
So the question is what must be put in the wait loop.
There are many variants, but they can be grouped in 3 classes, from the least efficient to the most efficient.
The worst method for acquiring the lock is to use an atomic read-modify-write instruction in the wait loop, like in the first example from the blog posting.
Historically, this is the earliest proposed method, with atomic test-and-set used first by Univac, then atomic swap proposed by Dijkstra, then atomic compare-and-swap used by IBM System/370.
In modern computers with memories that are very slow in comparison with the CPUs and with complex cache hierarchies, the atomic read-modify-write instructions have a very high overhead and using any of them inside a wait loop is a big mistake.
The second class of methods for acquiring the lock is improved over the first by moving the atomic read-modify-write instruction before the wait loop and using just a load-acquire instruction inside the loop.
The best known method of this second class is the ticket lock, where, before waiting, a ticket is obtained with an atomic fetch-and-increment instruction, then the order number for the next owner of the lock is loaded in the loop, until it becomes equal to the ticket obtained previously.
For this second class, there remains a reason for low efficiency, that all the threads that wait to acquire the lock load a value from the same address. Most of the time they load the value from their own cache, with minimum overhead, but every time when a thread releases the lock its new value will have to be broadcasted to the caches of all other threads, generating a great number of transfers.
The third class of methods for acquiring a lock is like the second, except that it eliminates the contention for a single memory address that is loaded in the wait loop.
To achieve this, the unique location that is known by all threads is no longer a location tested inside the loop, but it is used outside the loop to provide the address that should be tested inside the loop and which must be different for each waiting thread, so there will be no contention inside the wait loops as all loads will be from different addresses.
This kind of queuing for acquiring a lock can be achieved e.g. when each thread that wants to acquire a lock uses an atomic exchange instruction to obtain the address from which it should load values to determine when the lock becomes released, while simultaneously storing the address where it will signal the release of the lock, providing it for the next thread which will want to acquire the lock.
Normally the first 2 classes of lock acquiring methods are only of historical interest, there is no reason to ever use them, because they do not have any advantage in comparison with the third class of methods, which minimize the overhead of the wait loops by having inside the loops only a load-acquire instruction from addresses that are distinct for all threads.
The first method has the advantage of being very simple and only takes one bit of space. Also if your spin lock is almost always uncontended there very little difference, while if it is contended probably a spin lock is not appropriate anyway.
So I wouldn't consider trivial spinlocks just of historical interest.
In user-mode programs where simple mutual exclusion is needed, trivial spinlocks shall never be used, but only operating system functions like FUTEX_WAIT on Linux or WaitOnAddress on Windows.
In privileged code or in sophisticated user-mode applications, e.g. where the OS scheduler is bypassed and the threads are pinned on cores, trivial spin locks shall never be used, but only deterministic, efficient and fair spin locks.
So in my opinion there does not exist any use case when trivial spin locks can be recommended.
The spin lock acquiring methods of the first class that I have mentioned, even if they require the least storage, i.e. of only one cache line, they are not even deterministic, i.e. they do not guarantee forward progress for the thread that attempts to acquire the lock.
The second and the third class always guarantee forward progress. In the simplest case they ensure FIFO behavior for the lock, but they can be modified to implement a priority queue for the threads waiting to acquire the lock.
A ticket lock requires only 2 cache lines instead of 1 cache line, while the third class of methods require one cache line per thread, besides one cache line known globally.
IMO the only use case for trivial spin locks is to be used by the implementation of a more sophisticated mutex implementation. It helps to implement a mutex that starts with a spin lock, spins a short while during lock acquisition, and then falls back to FUTEX_WAIT.
Sure, because putting futex into the critical path of uncontended locks would be an efficiency disaster. All the major libraries and runtimes spin for a bit before resorting to futex.
I do not believe that spinning a bit before invoking the operating system is an efficient method.
Checking at least once if the lock is free is absolutely necessary before invoking the operating system.
Optionally, it is good to wait a short time (e.g. with an empty loop) and then check again if the lock is free, before invoking the operating system to wait.
On the other hand, to execute repeatedly in a loop atomic read-memory-write instructions on the address contended with other threads wastes energy and it wastes time for the other threads, due to the generated cache traffic. Even if spinning on the lock gives a higher chance for your thread to be the first that will get the lock, this makes the performance worse for the whole system. If you must be certain that no other thread gets the lock before, at least a ticket lock should be used, if not a better queued lock.
If you define spinning to mean the most naive and bad way to repeatedly check a condition, then yes that would be bad, tautologically. Some libraries have used the x86 PAUSE instruction or its equivalent. Recent x86 offers UMWAIT which is also applicable. These are both good and efficient ways to briefly wait.
I have only used naive spinlocks on hacky ducttape for embedded RTOSes. And most of the time those will provide better mutex or semaphore functions you should be using instead of the spinwait.
Interesting that the example uses ARM but the moral of the story is "you shouldn't need to do this, and if you do, you have bigger problems"
There are plans to support userspace spinlocks in Linux by letting a userspace program borrow from the next scheduling quantum: https://lwn.net/Articles/948870/
I have used one bit spinlocks to protect the wait queue of a more sophisticated waiting primitive for an user-space[1] application. Adding to the queue and signalling all could be done lock-free, but lock-free removal of a waiter before a signal (which I needed to support) is quite complicated, and borrowing a bit from the wait list head field to encode a list-deleter spinlock was the best solution.
[1] one-thread per core, pinned, isolcpu, NO_HZ, etc. The usual.
The second and third class, as described, do not support release/“stop wait” before acquisition. Once you start waiting you must continue waiting until you acquire the lock and then release otherwise it stalls forever.
I imagine it might be solvable if each acquire, B, also atomically encodes its acquire criteria to the previous acquire, A, so that if A does a early release it can forward its acquire criteria to B.
> For releasing a lock, it is obvious enough that a single atomic store is sufficient
It is not obvious at all. You need a fence plus a store (or a fenced store). And for a long time many locks implementations used a full fence. IIRC, for example, only relatively recentish there was enough confidence to relax the spinlock release to a release barrier on Linux.
You are right, but I have not mentioned this detail because it is common for all kinds of locks and I wanted to point out which are the important differences between them, not to give an exhaustive description of a spin lock algorithm.
Releasing a lock must be done with a release barrier followed by an atomic store.
On ISAs that do not have release barriers, a full memory barrier is required instead.
All x86 stores include an implicit release barrier and the 64-bit Armv8 and Armv9 CPUs have store-release instructions that include an implicit release barrier, so on such CPUs an explicit release barrier is not needed.
Clever. But aside that that it would be hardly appropriate for a generic spinlock implementation, I'm not sure how would you thread a store-store dependency without a store fence (a load-store dependency would work of course, but that's not enough).
I.e. how do you make your spin lock release store depend on all stores in the critical section? A store->load->store wouldn't work because of store buffers and load forwarding.
[Edit: another way of saying this is that there is no store equivalent of load_consume]
But I'm really only familiar with x86 and I do not know all the control and data dependency tricks used in more relaxed arches.
Only 3-cycle swap is nice. I can see how you could write very specialized code for some critical sections completely omitting all barriers even on the acquire side. But 3-cycles is low even for plain loads. Is the address-to-use latency so low on firestorm?
3 cycles is throughput, not latency; latency for normal loads is not very different from desktop parts. They win for atomic ops because they don't have a fence built in; I don't understand why intel hasn't specced non-fencing atomic ops (there are far atomics, but those are different).
Ah, right, misread the table. I think the latency would be more relevant in this case.
Re Intel atomics, yes being fully fenced by default is annoying, but it kind of falls out of TSO: it would be strange if an RMW were to be more relaxed than normal stores, so, like all stores, it has to wait for prior stores to be flushed out from the store buffer before continuing, which is high latency. Normally the latency is not observable for stores, but an RMW necessarily cannot execute the load ahead of the store.
I haven't really found much documentation on them, but it is possible the far atomics might end up having these relaxed semantics (and I suspect most of the time they won't be implemented as actual far atomics). Do you have any more info on them than what's on the Intel docs?
The test-and-set lock has the "advantage" of being unfair. Being unfair could be seen as a problem, and indeed it is, especially when access to memory takes a time that depends on who performs the access. However, if the lock holder could be preempted, and you have no way to schedule it back, it will cause all waiters to delay even if one is runnable, and that will quickly send performance down a cliff.
For that reason test-and-set spinlocks are sometimes used in userspace or in virtual machines.
> In modern computers with memories that are very slow in comparison with the CPUs and with complex cache hierarchies, the atomic read-modify-write instructions have a very high overhead and using any of them inside a wait loop is a big mistake
AFAIK what looks like a memory write to you just get pushed into cache, then for any other processors accessing that location, it goes through the cache coherency mechanism which is independent from the RAM and vastly faster. So my understanding is that there won't be the speed cost you expect. (But I am not an expert)
The way you wrote it there is a use-after-free because other_flag is not valid anymore immediately after the previous holder exits. Instead you spin on your own flag (which the holder sets to true when releasing the lock).
The correct code requires the local atomic to be a pointer to struct, containing both a pointer to the next link and the "still locked" flag. Alternatively, the locked flag can be snuck into the low bit of the pointer.
ARM implementation with LDAXR: https://github.com/ARM-software/arm-trusted-firmware/blob/ma...