Even after watching these videos and reading lots of articles on the topic, I still find the full C++ memory model extremely hard to understand. However, on x86 there are actually only a couple of things that one needs to understand to write correct lock free code. This is laid out in a blog post: https://databasearchitects.blogspot.com/2020/10/c-concurrenc...
C++'s "seq_cst" model is simple. If you're having any issues understanding anything at all, just stick with seq_cst.
If you want slightly better performance on some processors, you need to dip down into acquire-release. This 2nd memory model is faster because of the concept of half-barriers.
The compiler, CPU, and cache is ALLOWED to rearrange the code into the following:
acquire_barrier(); // Optimizer moved a() and b() from outside the barrier to inside the barrier
a();
b();
d();
c();
e();
g();
f();
release_barrier(); // Optimizer moved g() and f() from outside the barrier to inside the barrier
You're allowed to move optimizations "inside" towards the barrier, but you are not allowed to rearrange code "outside" of the half-barrier region. Because more optimizations are available (for the compiler, the CPU, or the caches), half-barriers execute slightly faster than full sequential consistency.
----------
Now that we've talked about things in the abstract, lets think about "actual" code. Lets say we have:
As the optimizer, you're only allowed to optimize to...
int i = 1; // a() and b() rearranged to the same line
full_barrier(); // Not allowed to optimize past this line
i+= 9; // c, d, and e rearranged
full_barrier();
i+= 11; // f, g rearranged.
Because all code can be rearranged to the "inside" of the barrier, you can simply write:
i = 21;
Therefore, half-barriers are faster.
----------
Now instead of the compiler rearranging code: imagine the L1 cache is rearranging writes to memory. With full barriers, the L1 cache has to write:
i = 1;
full_barrier(); // Ensure all other cores see that i is now = 1;
i = 10; // L1 cache allows CPU to do +2, +3, and +4 operations, but L1 "merges them together" and other cores do NOT see the +2, +3, or +4 operations
full_barrier(); // L1 cache communicates to other cores that i = 10 now;
i = 21; // L1 cache allows CPU to do +5 and +6 operations
// Without a barrier, L1 cache doesn't need to tell anyone that i is 21 now. No communication is guaranteed.
----------
Similarly, with half-barriers instead, the L1 cache's communication to other cores only has to be:
i = 21; // L1 cache can "lazily" inform other cores, allowing the CPU to perform i+=1, i+=2... i+=6.
So for CPUs that implement half-barriers (like ARM), the L1 cache can communicate ever so slightly more efficiently, if the programmer specifies these barriers.
----------
Finally, you have "weak ordered atomics", which have no barriers involved at all. While the atomics are guaranteed to execute atomically, their order is completely unspecified.
There's also consume / release barriers, which no one understands and no compiler implements. So ignore those. :-) They're trying to make consume/release easier to understand in a future standard... and I don't think they got all the "bugs" out of the consume/release standard yet.
-------
EDIT: Now that I think of it, acquire_barriers / release_barriers are often baked into a load/store operation and are "relative" to a variable. So the above discussion is still inaccurate. Nonetheless, I think its a simplified discussion to kinda explain why these barriers exist and why programmers were driven to make a "more efficient barrier" mechanic.
To the edit: right. I like the description using half barriers, but I have trouble reconciling that with the Linux Kernel's READ_ONCE/WRITE_ONCE macros, which guarantee no tearing/alignment issues, but boil down to reads/writes through casts to volatile qualified pointer dereferences. I guess those don't have the same notion of memory ordering that the C++11 API has... Maybe rmb()/wmb()...
Well... I didn't describe the C++11 memory model above. I had a gross simplification, because I didn't account for how the C++11 memory model acts "relative to a variable". (And this "variable" is typically the mutex itself).
My understanding is that WRITE_ONCE / READ_ONCE are meant for this "relative to a variable" issue. Its _precisely_ the issue I ignored in my post above.
All C++11 atomics are "relative to a variable". There are typically no memory-barriers floating around by themselves (there can be, but, you probably don't need the free-floating memory barriers to get the job done).
So you wouldn't write "acquire_barrier()" in C++11. You'd write "atomic_var.store(value, memory_order_release)", saying that the half-barrier is relative to atomic_var itself.
----------
a();
b();
while(val = atomic_swap(spinlock, 1, acquire_consistency), val!= 0) hyperthread_yield(); // half-barrier, write 1 into the spinlock while atomically reading its previous value
c();
d();
e();
atomic_store(spinlock, 0, release_consistency); // Half barrier, 0 means we're done with the lock
f();
g();
So the C++ acquire/release model is always relative to a variable, commonly the spinlock.
This means that "c, d, and e" are protected by the spinlock (or whatever synchronization variable you're working with). Moving a or b "inside the lock" is fine, because that's the "unlocked region", and the higher-level programmer is fine with "any order" outside of the locked region.
Note: this means that c(), d(), and e() are free to be rearranged as necessary. For example:
while(val = atomic_swap(spinlock, 1, acquire_consistency), val!= 0) hyperthread_yield(); // half-barrier, write 1 into the spinlock while atomically reading its previous value
for(int i=0; i<100; i++){
value+=i;
}
atomic_store(spinlock, 0, release_consistency); // Half barrier, 0 means we're done with the lock
The optimizer is allowed to reorder the values inside into:
while(val = atomic_swap(spinlock, 1, acquire_consistency), val!= 0) hyperthread_yield(); // half-barrier, write 1 into the spinlock while atomically reading its previous value
for(int i=99; i>=0; i--){ // decrement-and-test form is faster on many processors
value+=i;
}
atomic_store(spinlock, 0, release_consistency); // Half barrier, 0 means we're done with the lock
Its the ordering "relative" to the spinlock that needs to be kept. Not the order of any of the other loads or stores that happen. As long as all value+=i stores are done "before" the atomic_store(spinlock) command, and "after" the atomic_swap(spinlock) command, all reorderings are valid.
So reordering from "value+=0, value+=1, ... value+=99" into "value+=99, value+=98... value+=0" is an allowable optimization.
----------
It seems like WRITE_ONCE / READ_ONCE was written for DEC_Alpha, which is far weaker (less guarantees about order) than even ARM. DEC_Alpha was the first popular multicore system, but its memory model allowed a huge number of reorderings.
WRITE_ONCE / READ_ONCE probably compile into no-ops on ARM or x86. I'm not 100% sure, but that'd be my guess. I think the last 20-years of CPU design has overall said that the DEC_Alpha's reorderings were just too confusing to handle in the general case, so CPU designers / low-level programmers just avoid that situation entirely.
"dependent memory accesses" is very similar to the confusing language of memory_order_consume. Which is again: a model almost no one understands, and almost no C++ compiler implements. :-) So we can probably ignore that.
Wrt Alpha, I think a large part of the weirdness was that some early variants had split caches which weren't coherent with each other or something like that. So if a pointer and the value it pointed to where in different cache banks you could get funny effects.