1. The compiler's #1 job is removing code (aka optimizing): it does this by rear...

nyanpasu64 · on July 12, 2021

Ten years on, was the C++11 memory model (which I've used) a success? Compared to the Linux kernel memory model (which I haven't used)? I heard compilers can't remove dead reads because they can synchronize in rare situations, and sequential consistency was defined in a broken way and later fixed in a standards revision, and memory_order_consume is impossible to correctly implement in a way that's actually more optimized than memory_order_acquire, and the C++ memory model doesn't translate well to GPUs.

Is this better than the state of affairs prior to standardized atomics (which I haven't experienced)? Is it better than Go "defining enough of a memory model to guide programmers and compiler writers" (which I haven't used)? Or informally defining a set of use patterns, and writing optimizations around those use patterns rather than a formal model for what code and what optimizations are permitted (resulting in optimization steps that are only incorrect in combination, like global value numbering causing miscompilations[1][2])?

[1]: https://github.com/rust-lang/rust/issues/45839

[2]: https://bugs.llvm.org/show_bug.cgi?id=35229

EDIT:

> the critical detail about [relaxed/unsynchronized] operations is not the memory ordering of the operations themselves but the fact that they have no effect on the synchronization of the rest of the program.

Many people wish C++ memory_order_relaxed had no effect on synchronization and could be optimized like a normal access. It's not. https://internals.rust-lang.org/t/unordered-as-a-solution-to...

adwn · on July 12, 2021

> Ten years on, was the C++11 memory model (which I've used) a success?

Concurrent/parallel programming in C and C++ before the memory model was an absolute shitshow. You could either

a) scream "YOLO lol" and resort to abusing volatile (and secretly hoping that no one will ever actually execute your code on a CPU with more than one core [1]), or

b) carefully construct synchronization routines in assembly and try to make sure that the single compiler you support doesn't screw you over in its effort to make your program run super-fast (and slightly wrong) [2], or

c) use a library which handles b) for you.

[1] FreeRTOS does this, and it only supports a single core.

[2] Linux did (does?) this.

nyanpasu64 · on July 12, 2021

Yikes, that is scary. Perhaps Go was more of a clear improvement than dealing with this "shitshow", and C++ has become more usable for concurrent/parallel code in the years since.

I still find data races (sometimes crashes) in the wild on a regular basis. For example, RSS Guard accesses shared data unsynchronized when syncing settings from a server, so performing two types of syncs at once on 2 different threads will crash when they reallocate the same vector. Qt Creator intermittently crashes (or at least used to) in some tricky CMake handling code with multithreaded copy-on-write string lists. And I see apps now and then that perform unsynchronized writes to memory concurrently read in another thread, and it usually doesn't misbehave.

dragontamer · on July 12, 2021

Acquire-Release seems to be an outstanding success. ARMv8 added new assembly language statements to support it... as did NVidia GPUs (clearly CUDA / PTXis moving towards Acquire-release semantics), compilers from all around, etc. etc. So many systems have implemented acquire-release that I'm certain it will be relevant into the future.

Consume-release is a failure, but it seems like it was "expected" to be a failure to some degree. Consume-release was apparently the model that ARMv7 / Older-POWER assembly designers were going for, but it turned out to be far too complicated to think about. No compiler seems to be using consume-release anywhere (instead turning consume into Acquire).

From my understanding, the Linux-kernel operations could be consume-release, but only if the compilers fully understood the implications. (But no one seems to fully understand them). Maybe a future standard will fix consume-release, but best to ignore it for now.

Anyway, ARMv8 and POWER9 have changed their assembly language to include Acq/Release level semantics.

Fully relaxed is... not a model at all and does the job spectacularly! Some people don't want any ordering what so ever, lol.

Seq-cst is basically Java's model and it works for those who don't care about optimizations (it will necessarily be slower than Acquire/release. But there's a few cases where acquire/release is a trap and Seq-cst is necessary). It doesn't work on GPUs though as GPUs don't have snooping caches / coherence IIRC. So the strongest you can get in CUDA-land is Acq-release.

namibj · on July 13, 2021

I wish there was actual software support for Acq/Release-like semantics, but somewhat more relaxed by way of e.g. specifying two stores (data and pointer-to-date) to require in-order visibility, without enforcing a strong ordering of this store pair relative to other (semantically unrelated) stores.

Barrier-based abstractions could handle that, if they support more than one barrier. For loads, this would allow efficient dependent load reordering, by essentially enforcing the ordering only where needed for concurrency reasons (this mostly helps speculating loads before the address is confirmed, and not needing to snoop for invalidations of the cache line that contains the speculated address/killing the load), and similarly taking pressure of the store buffer by being less strict about the order in which it commits to L1D$.

RISC-V's propose WMM has such weak default ordering, but due to using fences, it's overly strict to the point where it performs worse on heavy concurrent code that's littered with atomics, compared to a TSO version (of the same softcore) that "just" prefetches exclusive access for writes. Even when turning RMW into relaxed semantics, so it's just due to the overly-strict load fence that effectively trashes all shared-state L1D cachelines.

nyanpasu64 · on July 12, 2021

> Fully relaxed is... not a model at all and does the job spectacularly! Some people don't want any ordering what so ever, lol.

Many people wish C++ memory_order_relaxed had no effect on synchronization and could be optimized like a normal access. It's not. https://internals.rust-lang.org/t/unordered-as-a-solution-to...