Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm no expert, but the only big architectural differences are a massively larger decoder and a reorder buffer that's several times as large as x86 designs.

If these are actually the reasons for the performance difference, and it's difficult to do these on x86 because of the instruction set, it seems to this amateur that ARM64 really does have an advantage over x86.



Don't forget ARM's more relaxed memory model vs. x86's TSO.


One of the reasons Rosetta 2 works so well is Apple silicon sticks to the more restricted x86 memory model.


Does it? Apple's documentation seems to disagree [1]:

"A weak memory ordering model, like the one in Apple silicon, gives the processor more flexibility to reorder memory instructions and improve performance, but doesn’t add implicit memory barriers."

[1] https://developer.apple.com/documentation/apple_silicon/addr...


It's switchable at runtime. Apple silicon can enable total store ordering on a per-thread basis while emulating x86_64, then turn it back off for maximum performance in native code.

Here's a kernel extension someone built to manipulate this feature: https://github.com/saagarjha/TSOEnabler


Couldn't Intel just come out with a new set of reduced-complexity instructions that run on a per-process basis based on some bit being flipped on context switches? Then legacy apps would run fine, but the new stuff would work too. This seems not that hard to address.


As I understand it, the challenge to making wider x86 chips is the mere existence of some instructions. Adding new instructions can't help with that. But I'm just repeating what I heard elsewhere:

> Other contemporary designs such as AMD’s Zen(1 through 3) and Intel’s µarch’s, x86 CPUs today still only feature a 4-wide decoder designs (Intel is 1+4) that is seemingly limited from going wider at this point in time due to the ISA’s inherent variable instruction length nature, making designing decoders that are able to deal with aspect of the architecture more difficult compared to the ARM ISA’s fixed-length instructions.

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...


I find that odd. Don’t they have some sort of icache? Intel could decode into a fixed width Alternative instruction set inside the icache, then use a wider decode when actually executing.


Yes, they have a cache for decoded operations. It'll hold a certain number but it's sort of inefficient because the fixed width decoded instructions are a lot larger than the variable length instructions so it doesn't hold too many. Because it doesn't help on code with large footprints and not too much time in inner loops you don't necessarily want the number of ops you can get form it to be too much more than the width of the rest of the system if you want a balanced design.


The ISA differences between ARM and x86 do not account for the difference in performance, there are multiple factors here (process, ssd, memory bandwidth, cache, thermal reservoir, etc).

While this is wonderful for ARM in the now-term, we just moved from walled ISAs to a plurality of ISAs, compute just became a bulk commodity in a way that it could not with an x86 duopoly.

Anyone can now take off the shelf RISC-V designs that are currently at > 7.1 coremarks/mhz and get them fabbed on Glofo or TSMC. If you need integrator help, you can use the design services of SiFive.


There’s not a shred of evidence RISC-V can approach the levels of performance discussed in this thread. There’s a lot of “big implementations can potentially do X” hand waving in RISC-V land, and not much real silicon to show for it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: