The reason it doesn't help them that much in performance is that the difference ...

gpderetta · on Dec 1, 2020

Yes. In particular for a long time x86 had much cheaper memory barriers than the competition. Not only store release and load acquires were free, but sequentially consistent barriers were also fairly cheap. Intel had to add optimizations to make them fast much earlier, while weaker RISCs could get away with expensive barriers for much longer.

Now apparently M1 has very cheap atomics and barriers, to it is doing as much tracking and speculation as an x86 CPU. But probably they can get away with tracking fewer memory operations, which might improve both performance and power usage.

my123 · on Dec 1, 2020

And on Arm cores, you aren't obligated to provide a weak memory model. Some Arm vendors provide cores with stronger memory models, mainly TSO in Arm server land and sequential consistency for NVIDIA in-house CPUs.

haberman · on Dec 1, 2020

Retiring memory operations earlier must have some benefit though? If data is being passed between threads, earlier retirement of a store means the data is available earlier to other CPUs. For a mutex, it means the mutex is released sooner, no?

MaxBarraclough · on Dec 1, 2020

Given all of this, what approach does RISC-V take?

The RISC-V is presumably very 'greenfield' regarding its memory model, I imagine they've made their decision based on all this knowledge.

my123 · on Dec 1, 2020

A weak memory model. See https://riscv.org/wp-content/uploads/2018/05/14.25-15.00-RIS... (by Dan Lustig at Nvidia, NV is both very involved in Arm and RISC-V).

zozbot234 · on Dec 1, 2020

It has a weak memory model by default, but you can opt in to TSO (like x86) as an optional extension.

my123 · on Dec 1, 2020

And for Arm too, you can ship stronger memory models up to SC.

There's also the RCpc extension for memory accesses w/ another memory model there.

jabl · on Dec 1, 2020

Is RCpc still multi-copy atomic? (Which AFAIK is a guaranteed behavior of the Aarch64 memory model)

my123 · on Dec 2, 2020

No difference on that front, yep.

deadmutex · on Dec 1, 2020

Thank you for your insightful comment. Do you mind going a bit more into the decode advantage?

my123 · on Dec 1, 2020

32-bit fixed size instructions, that are aligned. That allows you to have 8-wide decoders for example.

On x86, you can't really do this/didn't happen because the instructions are 1 to 15 bytes long, and you'll need to decode the prior instruction at least in part to determine where your current instr starts. (currently 4-wide decoders are available for x86 uArches)

mhh__ · on Dec 1, 2020

Fixed width instructions are much easier to decode at the same time as their neighbours.

The density is a tradeoff you have to think about, but X86 is not a clean ISA.

For a more alien example, the mill cpu has very long variable length instructions for density reasons but because it's specifically designed around that and not needing any implicit parallelism they can use tricks to find instruction boundaries much more easily.

Dylan16807 · on Dec 1, 2020

The biggest thing for quickly finding instruction boundaries is probably not having prefixes, which is pretty straightforward.

cesarb · on Dec 1, 2020

Not having suffixes helps too. AFAIK, on x86 to find the length of an instruction, first you have to decode the instruction itself (the decoding can vary depending on the prefixes) to know if it is followed by a modRM byte, then decode the modRM byte to know whether it is followed by an immediate or not and how big that immediate is, and only then can you know where the next instruction is. The modRM byte (and the SIB byte) can be thought of as a "suffix" of sorts to the instruction.

Contrast for instance with RISC-V, where the first byte of every instruction has everything necessary to know the instruction length (and you only need 2 bits of that byte unless your core supports instructions longer than 4 bytes).