> As high-performance code demands faster-and-faster systems, the more-and-more ...

dragontamer · on Nov 20, 2019

There must be some kind of communication error going on. I don't know much about RCU, so I just pulled up this webpage:

https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt

In it is:

> The rcu_read_lock() and rcu_read_unlock() primitive read-acquire and release a global reader-writer lock.

Seems like RCU-operations in the Linux kernel are defined in acquire-barrier and release-barrier terms. I heard a while ago that RCU could be discussed in terms of release-consume semantics (which are slightly faster but harder to understand...) but very few people understand release-consume.

As such, release-acquire is probably the memory model of the future. I'm not really aware of anything aside from: Fully Relaxed (unordered), the obscure release-consume, release-acquire, and finally sequentially consistent (too slow for modern systems)

---------

Are you perhaps confusing "acquire-release" semantics (which is a memory-barrier / cache coherence principle) with spinlocks perchance? Acquire-release seems to be the "Fastest-practical" memory consistency model. (Since Relaxed doesn't work, and release-consume is too confusing)

For more info on acquire-release, Preshing's blogposts are great: https://preshing.com/20130922/acquire-and-release-fences/

jabl · on Nov 21, 2019

> I'm not really aware of anything aside from: Fully Relaxed (unordered), the obscure release-consume, release-acquire, and finally sequentially consistent (too slow for modern systems)

What about Total Store Ordering (TSO), which is what e.g. the obscure and rare x86(-64) architecture implements (and SPARC as well)?

dragontamer · on Nov 21, 2019

That is a good point: the x86 model is "stronger" than acquire-release. Which is probably why it took so long for acquire-release to become beneficial. Any x86 coder who codes in acquire-release will not see any performance benefit on x86, because x86 implements "stronger guarantees" at the hardware level.

Well, that is until you enable gcc -O3 optimizations, which will move memory around, merge variables together, and other such optimizations that will follow the acquire-release model instead of TSO. Remember that the compiler has to consider the memory-consistency model between registers and RAM (when are registers holding "stale" data and need to be re-read from RAM?)

-------

The thing is, acquire-release is becoming far more popular and is the golden-standard that C++11 has more or less settled upon. C++11, ARM, POWER9, CUDA, OpenCL have moved onto acquire-release semantics for their memory model.

Next generation PCIe 5.0, CXL, OpenCAPI, are all looking at extending cache-coherence out to I/O devices such as NVMe flash and GPUs / Coprocessors. I'm betting that Acquire/release will become more popular in the coming years. TSO is too "strict" in practice, people actually want their reads-and-writes to "float" out of order with each other in most cases, especially when you're talking about a PCIe-pipe that takes 5-microseconds (20,000 clock-ticks!!) to communicate over.

jabl · on Nov 21, 2019

> That is a good point: the x86 model is "stronger" than acquire-release. Which is probably why it took so long for acquire-release to become beneficial. Any x86 coder who codes in acquire-release will not see any performance benefit on x86, because x86 implements "stronger guarantees" at the hardware level.

Yes, in a way it's a race to the bottom; code that works on TSO hw works on acquire-release hw, but not the other way around. There's only two ways to combat this race: education, and using concurrency libraries written by people who know what they're doing.

> acquire-release is becoming far more popular and is the golden-standard that C++11 has more or less settled upon

Hmm, how come? C++11 supports many different models, relaxed, acquire/release, and sequential consistency, with sequential consistency being the default for atomic variables. Now, acquire/release looks like a decent compromise between ease of hw implementation and programming complexity, but AFAICS it's not the anointed one true model.

To some extent I think that's a failing of the C++11 model. Instead of choosing one (sane) model, they made people choose between an array of models with subtle semantics. That's what the recent formal Linux kernel model did, although that's not ideal either, with the requirement to not be too different from the previous informal description and boatloads of legacy code. See http://www0.cs.ucl.ac.uk/staff/j.alglave/papers/asplos18.pdf

In general, it seems to me that progress is being made in formal memory models, and I hope that in some years time there will be some kind of synthesis giving us a model that is both reasonably easy to implement in hw with good performance, easy enough to reason about, as well as formally provable. We'll see.

dragontamer · on Nov 21, 2019

> Hmm, how come? C++11 supports many different models, relaxed, acquire/release, and sequential consistency, with sequential consistency being the default for atomic variables. Now, acquire/release looks like a decent compromise between ease of hw implementation and programming complexity, but AFAICS it's not the anointed one true model.

Well, nothing will ever be "officially" blessed as the one true model. As the saying goes: we programmers are like cats, we all will be moving off in our own direction, doing our own thing.

Overall, I just think that "programmer culture" is beginning to settle down on Acquire-release semantics. Its just a hunch... but more-and-more languages (C++, CUDA), and systems (ARM, POWER, NVidia GPUs, AMD GPUs) seem to be moving towards Acquire-release.

And in the next few years, we'll have cache-coherency over PCIe 4.0 or PCIe 5.0 in some form (CXL or other protocols on top of it). A unified memory model across CPU, DDR4 RAM, the PCIe-bus, and co-processors (GPUs, FPGAs, or Tensor cores), and high-speed storage (Optane and Flash SSDs over NVMe) is needed.

The community is just a few years out from having a unified memory model + coherent caches across the I/O fabric. Once this "defacto standard" is written, it will be very hard for it to change. That's why I think acquire-release is here to stay for the long term. Its the leading memory model right now.

atq2119 · on Nov 21, 2019

Keep in mind that even when the underlying hardware implements TSO, what programming languages expose is basically the release/acquire model of memory semantics.

This means that as a programmer, you still have to code against the release/acquire model because the compiler may reorder your memory accesses. Having TSO in hardware is still helpful though, because it means the compiler has to emit fewer explicit barrier instructions at the end. That is, the barriers that you do have in your original code end up being a little bit cheaper (at the cost of having an overall more complex hardware architecture).

gpderetta · on Nov 21, 2019

In TSO every store is a release and every load is a acquire, so it maps very efficiently to the acquire/release model.

jabl · on Nov 21, 2019

Sure, since it's a stronger model, so code which works on weaker acquire/release hw will work on TSO hw as well. You might as well say that sequential consistency maps very efficiently to a acquire/release model too in the same way.

gpderetta · on Nov 21, 2019

Isn't TSO the closest practical implementation to a acquire/release model? What are the practical differences?

I know that TSO allows more easily to recover sequential consistency with additional barriers (Intel strengthened their original memory model to TSO for this reason).

jabl · on Nov 21, 2019

No, acquire/release is a much weaker model than TSO. TSO is basically sequential consistency (SEQCST) + a store buffer. I.e. stores don't go immediately to main memory, but rather via a store buffer. Loads first peek into the store buffer of the local CPU core before going to memory. The practical effect is that in contrast to SEQCST stores may be reordered after loads ("store->load" reordering more formally).

Acquire-release consistency allows many more reorderings in addition to store->load (load->load, load->store, store->store).

For more info see e.g. Table 5 in http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2...