I'm *no* expert, but the only big architectural differences are a massively larg...

twoodfin · on Nov 17, 2020

Don't forget ARM's more relaxed memory model vs. x86's TSO.

unsigner · on Nov 17, 2020

One of the reasons Rosetta 2 works so well is Apple silicon sticks to the more restricted x86 memory model.

enzo1982 · on Nov 17, 2020

Does it? Apple's documentation seems to disagree [1]:

"A weak memory ordering model, like the one in Apple silicon, gives the processor more flexibility to reorder memory instructions and improve performance, but doesn’t add implicit memory barriers."

[1] https://developer.apple.com/documentation/apple_silicon/addr...

duskwuff · on Nov 18, 2020

It's switchable at runtime. Apple silicon can enable total store ordering on a per-thread basis while emulating x86_64, then turn it back off for maximum performance in native code.

Here's a kernel extension someone built to manipulate this feature: https://github.com/saagarjha/TSOEnabler

sfblah · on Nov 17, 2020

Couldn't Intel just come out with a new set of reduced-complexity instructions that run on a per-process basis based on some bit being flipped on context switches? Then legacy apps would run fine, but the new stuff would work too. This seems not that hard to address.

epistasis · on Nov 17, 2020

As I understand it, the challenge to making wider x86 chips is the mere existence of some instructions. Adding new instructions can't help with that. But I'm just repeating what I heard elsewhere:

> Other contemporary designs such as AMD’s Zen(1 through 3) and Intel’s µarch’s, x86 CPUs today still only feature a 4-wide decoder designs (Intel is 1+4) that is seemingly limited from going wider at this point in time due to the ISA’s inherent variable instruction length nature, making designing decoders that are able to deal with aspect of the architecture more difficult compared to the ARM ISA’s fixed-length instructions.

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

ant6n · on Nov 17, 2020

I find that odd. Don’t they have some sort of icache? Intel could decode into a fixed width Alternative instruction set inside the icache, then use a wider decode when actually executing.

Symmetry · on Nov 17, 2020

Yes, they have a cache for decoded operations. It'll hold a certain number but it's sort of inefficient because the fixed width decoded instructions are a lot larger than the variable length instructions so it doesn't hold too many. Because it doesn't help on code with large footprints and not too much time in inner loops you don't necessarily want the number of ops you can get form it to be too much more than the width of the rest of the system if you want a balanced design.

sitkack · on Nov 17, 2020

The ISA differences between ARM and x86 do not account for the difference in performance, there are multiple factors here (process, ssd, memory bandwidth, cache, thermal reservoir, etc).

While this is wonderful for ARM in the now-term, we just moved from walled ISAs to a plurality of ISAs, compute just became a bulk commodity in a way that it could not with an x86 duopoly.

Anyone can now take off the shelf RISC-V designs that are currently at > 7.1 coremarks/mhz and get them fabbed on Glofo or TSMC. If you need integrator help, you can use the design services of SiFive.

unsigner · on Nov 17, 2020

There’s not a shred of evidence RISC-V can approach the levels of performance discussed in this thread. There’s a lot of “big implementations can potentially do X” hand waving in RISC-V land, and not much real silicon to show for it.