> and in the world of ARM CPUs it's not. How can this be true? I'm not saying it...

Veliladon · on May 7, 2022

Most of the time you can increase performance either by increasing clock frequency or doing more per clock. Raising clock speed usually increases power exponentially. On a desktop this is usually the strategy because we can put decent cooling rigs on them.

Doing more per clock is difficult on an x86 compared to ARM. x86's instruction set is a hodgepodge collection of instructions of variable lengths and addressing modes. ARM64 on the other hand has far less addressing modes and a fixed 32-bit instruction length. When an x86 is trying to decode ahead of the instruction stream it needs to decode each instruction in order or have special logic to get around that which makes it more difficult to stay ahead of the processor. Normally you see an x86 chip described as having a certain number of complex and a certain number simple decoders because some instructions are just pigs of things to decode. Simple decodes will get stuff that decode to 3 uops or less while complex handles most of the rest. Some real pigs of instructions might even be sent to the microcode sequencer which generates a whole heap of uops which takes a while.

In the case of ARM64 every 4 bytes you have an instruction come hell or high water. On a chip like the M1 it takes 32-bytes of instructions, splits every 4 bytes between its 8 decoders, and each will spit out uops in parallel. From there the chip will issue those decoded instructions to the necessary execution ports. Because of the less complicated decoding, the huge increase in decoding throughput, and the huge reorder buffers an M1 can keep more of its execution ports busy. If twice as many execution ports can be kept full it means you can do the same amount of work in half as many clock cycles. Because you're only running at half the clock speed your power usage is way lower.

staticassertion · on May 7, 2022

Presumably the cost here is that your instructions are considerably larger, which means fitting fewer of them into cache?

danachow · on May 7, 2022

The code density of ARM64 is not that much worse than x64 - especially for anything generated by a modern compiler. You may get some small scale gains for hand tuned code with careful instruction and register selection (ie where Rex prefix can be more easily avoided) - but average binary density doesn’t overcome the aforementioned differences in efficiency.