Hacker News new | past | comments | ask | show | jobs | submit login

Acquire-Release seems to be an outstanding success. ARMv8 added new assembly language statements to support it... as did NVidia GPUs (clearly CUDA / PTXis moving towards Acquire-release semantics), compilers from all around, etc. etc. So many systems have implemented acquire-release that I'm certain it will be relevant into the future.

Consume-release is a failure, but it seems like it was "expected" to be a failure to some degree. Consume-release was apparently the model that ARMv7 / Older-POWER assembly designers were going for, but it turned out to be far too complicated to think about. No compiler seems to be using consume-release anywhere (instead turning consume into Acquire).

From my understanding, the Linux-kernel operations could be consume-release, but only if the compilers fully understood the implications. (But no one seems to fully understand them). Maybe a future standard will fix consume-release, but best to ignore it for now.

Anyway, ARMv8 and POWER9 have changed their assembly language to include Acq/Release level semantics.

Fully relaxed is... not a model at all and does the job spectacularly! Some people don't want any ordering what so ever, lol.

Seq-cst is basically Java's model and it works for those who don't care about optimizations (it will necessarily be slower than Acquire/release. But there's a few cases where acquire/release is a trap and Seq-cst is necessary). It doesn't work on GPUs though as GPUs don't have snooping caches / coherence IIRC. So the strongest you can get in CUDA-land is Acq-release.




I wish there was actual software support for Acq/Release-like semantics, but somewhat more relaxed by way of e.g. specifying two stores (data and pointer-to-date) to require in-order visibility, without enforcing a strong ordering of this store pair relative to other (semantically unrelated) stores.

Barrier-based abstractions could handle that, if they support more than one barrier. For loads, this would allow efficient dependent load reordering, by essentially enforcing the ordering only where needed for concurrency reasons (this mostly helps speculating loads before the address is confirmed, and not needing to snoop for invalidations of the cache line that contains the speculated address/killing the load), and similarly taking pressure of the store buffer by being less strict about the order in which it commits to L1D$.

RISC-V's propose WMM has such weak default ordering, but due to using fences, it's overly strict to the point where it performs worse on heavy concurrent code that's littered with atomics, compared to a TSO version (of the same softcore) that "just" prefetches exclusive access for writes. Even when turning RMW into relaxed semantics, so it's just due to the overly-strict load fence that effectively trashes all shared-state L1D cachelines.


> Fully relaxed is... not a model at all and does the job spectacularly! Some people don't want any ordering what so ever, lol.

Many people wish C++ memory_order_relaxed had no effect on synchronization and could be optimized like a normal access. It's not. https://internals.rust-lang.org/t/unordered-as-a-solution-to...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: