1. The compiler's #1 job is removing code (aka optimizing): it does this by rearranging code, merging code together, and other forms of optimizations. That is to say: "i=0; i++; i++" wants to optimize to "i=2;", but in a multithreaded context, it means that you'll "never" see "i==0", or "i==1". Turns out that "losing" these states can be an issue if you're building multithreaded primitives (lock-free code, atomics, etc. etc.)
2. The CPU and L1 cache also "effectively moves" code in relaxed-architectures like ARM / Power9 / DEC Alpha. So it turns out that #1 is true "even if the compiler wasn't involved".
3. Because of #2, might as well fix the compiler / CPU / L1 with the same set of primitives (the "memory model") that defines orderings.
4. Turns out that a large, but minority, number of programmers want to experiment with low-level multithreading primitives: researchers, speed-demon professionals, and others do want to go at "full speed" even at the cost of great complexity. Unifying the promises of the compiler + CPU + L1 cache all together in a SINGULAR model helps dramatically (that way: the programmer only fights the compiler. The compiler has to "translate" the CPU / L1 cache rearrangements into low-level barriers as appropriate)
-------
It turns out that the biggest source of speed improvements exists in this realm of "rearranging code". That's why CPUs do out-of-order execution. That's why CPUs (like ARM or POWER9) want more relaxed execution, to allow the CPU to rearrange more code in more situations. That's why L1 cache exists (and similar: L1 wants to rearrange reads/writes even more aggressively for even faster operations).
If you make a memory barrier (nominally preventing the L1 cache from rearranging code), you SHOULD have the CPU and compiler respect the barrier as well. After all, if the programmer says "this should not be rearranged", then that probably applies to compiler, CPU, and L1 cache all the same.
Ten years on, was the C++11 memory model (which I've used) a success? Compared to the Linux kernel memory model (which I haven't used)? I heard compilers can't remove dead reads because they can synchronize in rare situations, and sequential consistency was defined in a broken way and later fixed in a standards revision, and memory_order_consume is impossible to correctly implement in a way that's actually more optimized than memory_order_acquire, and the C++ memory model doesn't translate well to GPUs.
Is this better than the state of affairs prior to standardized atomics (which I haven't experienced)? Is it better than Go "defining enough of a memory model to guide programmers and compiler writers" (which I haven't used)? Or informally defining a set of use patterns, and writing optimizations around those use patterns rather than a formal model for what code and what optimizations are permitted (resulting in optimization steps that are only incorrect in combination, like global value numbering causing miscompilations[1][2])?
> the critical detail about [relaxed/unsynchronized] operations is not the memory ordering of the operations themselves but the fact that they have no effect on the synchronization of the rest of the program.
> Ten years on, was the C++11 memory model (which I've used) a success?
Concurrent/parallel programming in C and C++ before the memory model was an absolute shitshow. You could either
a) scream "YOLO lol" and resort to abusing volatile (and secretly hoping that no one will ever actually execute your code on a CPU with more than one core [1]), or
b) carefully construct synchronization routines in assembly and try to make sure that the single compiler you support doesn't screw you over in its effort to make your program run super-fast (and slightly wrong) [2], or
c) use a library which handles b) for you.
[1] FreeRTOS does this, and it only supports a single core.
Yikes, that is scary. Perhaps Go was more of a clear improvement than dealing with this "shitshow", and C++ has become more usable for concurrent/parallel code in the years since.
I still find data races (sometimes crashes) in the wild on a regular basis. For example, RSS Guard accesses shared data unsynchronized when syncing settings from a server, so performing two types of syncs at once on 2 different threads will crash when they reallocate the same vector. Qt Creator intermittently crashes (or at least used to) in some tricky CMake handling code with multithreaded copy-on-write string lists. And I see apps now and then that perform unsynchronized writes to memory concurrently read in another thread, and it usually doesn't misbehave.
Acquire-Release seems to be an outstanding success. ARMv8 added new assembly language statements to support it... as did NVidia GPUs (clearly CUDA / PTXis moving towards Acquire-release semantics), compilers from all around, etc. etc. So many systems have implemented acquire-release that I'm certain it will be relevant into the future.
Consume-release is a failure, but it seems like it was "expected" to be a failure to some degree. Consume-release was apparently the model that ARMv7 / Older-POWER assembly designers were going for, but it turned out to be far too complicated to think about. No compiler seems to be using consume-release anywhere (instead turning consume into Acquire).
From my understanding, the Linux-kernel operations could be consume-release, but only if the compilers fully understood the implications. (But no one seems to fully understand them). Maybe a future standard will fix consume-release, but best to ignore it for now.
Anyway, ARMv8 and POWER9 have changed their assembly language to include Acq/Release level semantics.
Fully relaxed is... not a model at all and does the job spectacularly! Some people don't want any ordering what so ever, lol.
Seq-cst is basically Java's model and it works for those who don't care about optimizations (it will necessarily be slower than Acquire/release. But there's a few cases where acquire/release is a trap and Seq-cst is necessary). It doesn't work on GPUs though as GPUs don't have snooping caches / coherence IIRC. So the strongest you can get in CUDA-land is Acq-release.
I wish there was actual software support for Acq/Release-like semantics, but somewhat more relaxed by way of e.g. specifying two stores (data and pointer-to-date) to require in-order visibility, without enforcing a strong ordering of this store pair relative to other (semantically unrelated) stores.
Barrier-based abstractions could handle that, if they support more than one barrier. For loads, this would allow efficient dependent load reordering, by essentially enforcing the ordering only where needed for concurrency reasons (this mostly helps speculating loads before the address is confirmed, and not needing to snoop for invalidations of the cache line that contains the speculated address/killing the load), and similarly taking pressure of the store buffer by being less strict about the order in which it commits to L1D$.
RISC-V's propose WMM has such weak default ordering, but due to using fences, it's overly strict to the point where it performs worse on heavy concurrent code that's littered with atomics, compared to a TSO version (of the same softcore) that "just" prefetches exclusive access for writes. Even when turning RMW into relaxed semantics, so it's just due to the overly-strict load fence that effectively trashes all shared-state L1D cachelines.
2. The CPU and L1 cache also "effectively moves" code in relaxed-architectures like ARM / Power9 / DEC Alpha. So it turns out that #1 is true "even if the compiler wasn't involved".
3. Because of #2, might as well fix the compiler / CPU / L1 with the same set of primitives (the "memory model") that defines orderings.
4. Turns out that a large, but minority, number of programmers want to experiment with low-level multithreading primitives: researchers, speed-demon professionals, and others do want to go at "full speed" even at the cost of great complexity. Unifying the promises of the compiler + CPU + L1 cache all together in a SINGULAR model helps dramatically (that way: the programmer only fights the compiler. The compiler has to "translate" the CPU / L1 cache rearrangements into low-level barriers as appropriate)
-------
It turns out that the biggest source of speed improvements exists in this realm of "rearranging code". That's why CPUs do out-of-order execution. That's why CPUs (like ARM or POWER9) want more relaxed execution, to allow the CPU to rearrange more code in more situations. That's why L1 cache exists (and similar: L1 wants to rearrange reads/writes even more aggressively for even faster operations).
If you make a memory barrier (nominally preventing the L1 cache from rearranging code), you SHOULD have the CPU and compiler respect the barrier as well. After all, if the programmer says "this should not be rearranged", then that probably applies to compiler, CPU, and L1 cache all the same.