CISC vs RISC is almost completely irrelevant these days.
They both borrowed so many ideas from each other that the architectures are nearly identical these days.
Neither modern ARM or modern x86 architectures deserve to be called RISC or CISC.
Still, there are a few areas where the A64 architecture is theoretically "better" than x64, due to the lack of legacy.
The first is instruction decoding, x86 had to deal with a whole bunch of weird instruction length modifiers, prefix bytes, weird MOD RM encodings and years of extensions. To decode the 4 or 5 instructions per cycle that modern out of order microarchitectures demand, Intel CPUs have to attempt decoding instructions at every single byte offset in a 16 byte buffer, throwing away unwanted decodings. That's got to waste transistor and power budgets.
In comparison, when an ARM CPU is in A64 mode, all instructions are the same length, making decoding multiple instructions per cycle trivial.
The second area is memory consistency guarantees. x64 has relatively strong guarantees, allowing simpler assembly code but at the cost of more complex memory subsystem between cores.
A64 has much weaker ordering guarantees, which saves on hardware complexity, but required the programmer (and/or compiler) to insert memory fences whenever stricter ordering is required.
This is all theoretical, I have no idea what difference this makes at the practical level.
> Still, there are a few areas where the A64 architecture is theoretically "better" than x64, due to the lack of legacy.
Every extant A64 processor needs to deal with ARMv8 instructions, thumb instructions, thumb2 instructions, shift encoding, microcoded multiple load/store instructions... It's true that the subset of the ISA that makes up the overwhelming bulk of instructions actually executed is very simple. But frankly that's true of x86 as well.
> when an ARM CPU is in A64 mode
That's... not the way hardware works. Those transistors are still there, it's not like you can make them go away by switching "modes". They still are ready to switch every cycle, and in any case having transistors that "aren't needed" doesn't make anything faster, because they all execute in parallel (or at worst in an extra pipeline stage or two) anyway.
Either you have a simple architecture or you don't. A64 is a simple ISA. Real CPUs have complicated legacy architectures.
Its not meaningful to distinguish Thumb and Thumb2 - Thumb2 was just a bunch of new instructions added to Thumb.
A typical ARMv8-A processor needs to decode the A64, A32 and T32 instruction sets, but that doesn't strike me as a significant burden. A32 and T32 are essentially just different encodings of the same instruction set - there are very few A32 or T32 instructions that don't have an equal in the other instruction set. A64 has more and wider registers, but is otherwise broadly similar in capabilities. I would expect that most ARMv8-A implementations unify the three instruction sets very early in the decoding process.
Retaining the system level aspects of AArch32 strikes me as more expensive, especially support for short page tables, the subtly different system register layout, the more complex relationship between PSTATE.M and the security state, and the banked system registers between the secure and non-secure states. I'm surprised ARM-designed cores haven't pushed harder to eliminate AArch32 at the higher exception levels (although I'm aware of some cores designed by ARM Architecture licensees that do so). Perhaps that ARM has retained AArch32 all the way up to EL3 is evidence that they believe doing so isn't very expensive.
> That's... not the way hardware works. Those transistors are still there, it's not like you can make them go away by switching "modes".
First, even in T32 mode, there are only two instruction lengths, 2 and 4 bytes. Requires much less transistors than modern Intel and AMD CPUs which supports single cycle decoding of instruction lengths from 1 to 15 bytes.
Second, due to prefix bytes, the bits which control the x64 instruction length a spread all throughout the instruction, and they interact with each other. There are instructions where every single byte modifies the length. Decoding variable length T32 is easy, there are only a few bits in the first byte that select between 16 and 32bit.
Third, T32 support is technically optional. Apple has required all apps to be recompiled in A64 mode for years, I'm not even sure if their latest CPUs still support switching to T32 mode.
Forth, mode switching does use less transistors than a design that has to chose on a per-instruction basis. It takes a lot of extra transistors to dynamically detect the instruction type, which have to be replicated 4 or 8 times are unnecessary. All you need is a single bit register to store the current state and a special instruction to switch between modes.
I recall at least the (now discontinued) Qualcomm Centriq (Falkor) did away with 32-bit ARM (ARMv7, thumb*, etc.) and only supported the 64-bit ARMv8. If the market you're targeting doesn't value backwards compatibility, why bother?
Wouldn't surprise me if there are other A64-only chips as well.
Indeed. Both falkor and thunderX2 are A64-only. The backwards compatible A32 mode is optional only and not implemented by the ARM server vendors except Ampere eMag.
iOS 11 removed support for 32bit applications. Because the iPhone X requires iOS 11 I’m not aware of any way to run 32bit code on A11 processors and above. A32 support is likely already gone at the hardware level.
Regarding power budget, that absolutely could be the way hardware works. Modern chips can and do power off pieces of silicon when they are not needed. Whether that would be worth it for instruction decoding, I don't know, but it could be.
Powering off pieces of silicon when they are not needed is done through "clock gating", where you stop feeding a clock to a block that is not needed.
That is only possible when you deal with isolated parts. You cannot, for example, power down an instruction decoders ability to understand different syntaxes, but only power down the entire instruction decoder. Trying to design it so that sub-features of a block like that can be powered down would not be productive.
A realistic clock-gating would be something like powering down the actual execution units ("We don't need AVX-512, so lets not waste power on the execution units"), but that doesn't help in saving power wasted on legacy.
You can absolutely design the instruction decoder into two parallel decoders that decode AArch32 and AArch64 respectively. Splitting Thumb from the rest of AArch32 probably doesn't make sense, and on x86 it probably doesn't make sense to break out 32 and 64 bit, but I can absolutely see the case for AArch32 vs AArch64.
> You can absolutely design the instruction decoder into two parallel decoders that decode AArch32 and AArch64 respectively.
You can design anything. The question is whether the added design complexity (which for silicon directly translates to increased power consumption) outweighs the benefits.
> There would be significant overhead to design a decoder such that it could switch between legacy and aarch64 only, but it could conceivably be done.
What you'd do then is to split the decoder into several blocks, so that there's a fan-out from a main decoder into the different sub-decoders, and then power down the sub-decoders. It's still entire blocks you power down.
Plus, I think the increased power consumption from this design (especially considering that the decoder now needs to stall on powered down sub-decoders) will outweigh the savings of powering down any sub-decoders.
> fyi clock gating isn't the same as power gating.
Of course not. Both clock gating and power gating are power saving designs. Clock gating and power gating both eliminate switching current entirely, while power gating also removes leakage current at the cost of larger architectural changes than those required by clock gating.
I'm out on a limb here, but I don't think power gating makes much sense outside extreme low-power devices.
I'm not a CPU guy, so forgive me for asking what is probably a really dumb or unclear question, but given that Apple controls the whole stack, couldn't they just remove the non-64 bit micro-code to save space?
It seems like if it's doable, that Apple would do it - especially since they killed off 32-bit apps in iOS 11 already, and therefore were able to remove the 32-bit code from the iOS codebase too.
"Microcode" is the wrong word, it will be physical hardware on the chip, but yes - the sibling comment by jabl claims that some of the A64 chips had no 32-bit mode.
>jabl claims that some of the A64 chips had no 32-bit mode.
Fair Enough. [1] States it is AArch 64 only. I spent some time trying to find the official answer to AArch32 being optional, unfortunately nothing concrete has come up.
The reference you're looking for is likely to be the ARMv8-A Architecture Reference Manual. For example, it mentions in multiple places "if AARCH32 is not implemented".
That's got to waste transistor and power budgets. In comparison, when an ARM CPU is in A64 mode, all instructions are the same length, making decoding multiple instructions per cycle trivial.
On the other hand, that wastes cache memory and fetch bandwidth. Instruction density is very important especially since caches are big and consume a lot of power too. I believe that if it weren't for the brief period in the 80s where memory speeds were higher than core speeds, what we know as "RISC" today would've never appeared.
This is true in an abstract comparison between fixed and variable length encodings, but not really true in the specific comparison of AArch64 and x86: the latter is far from an ideal fixed-length encoding, largely because of path dependent development of the current ISA.
A significant amount of space is wasted on old rarely used instructions, and the x86-64 encoding was chosen mostly based on similarly with the earlier 32-bit ISA, which uses decades old instruction frequencies. The addition of many ISA extensions has become progressively more difficult: the newest AVX-512 instructions which use the EVEX encoding have 4 (!!) prefix bytes before the instruction even starts. So just the prefix is as long as any AArch64 instruction.
The net result is that AAarch64 binary sizes are largely comparable with x86, and there is still a bit of juice to squeeze out as well as the ARM compiler backends haven't had the same decades of heavy optimization compared to x86.
Also, ARM is one of the densest RISC-style ISAs. A64 less so than A32 but it still has things like conditional execution of some of the most common instructions, loading two registers in one instruction, etc.
"On the other hand, that wastes cache memory and fetch bandwidth..."
This. One of the reasons why modern x86 processors have such strong performance is their external facing CISC interface effectively acts as memory compression while they are RISC on the inside. In many ways its a best of both worlds that was achieved through incremental evolution.
No, x86-64 is just as inefficient as AArch64 these days, because of all the REX prefixes. Almost every instruction needs one and it bloats the size tremendously, to the extent that most x86-64 instructions are just about 4 bytes. Measure binary sizes if you don't believe me.
Almost every instruction needs one and it bloats the size tremendously, to the extent that most x86-64 instructions are just about 4 bytes.
That's only if your code happens to be particularly "64-bit-heavy", or the compiler isn't doing a good job at selecting registers; the original designers (at AMD, not Intel) decided on the prefices (and defaulting to 32-bit for most ops) instead of defaulting all operations to 64-bit in 64-bit mode precisely because it would be better for size and performance --- using their carefully optimised compilers. Plus, what can be done with a single 4-byte instruction on x86 can require multiple 4-byte ARM instructions, and that adds up quickly.
I can't find it at the moment but one of the studies I remember comparing the binary sizes was using GCC, which is widely available and free, but probably one of the worst compilers at x86 size optimisation I've seen. I even recall a remark in that study about how it was generating mostly RISC-like instructions, so in other words they were comparing binaries generated for a RISC CPU using a RISC-oriented compiler with ones generated for a CISC CPU using a RISC-oriented compiler, failing to exploit the full capabilities of a CISC.
I've written x86 Asm for several decades (started with 16-bit --- dating myself here...), and done some occasional MIPS and ARM, and it's very difficult for me to believe that the RISCs have any intrinsic advantage in code density other than the fact that compilers for x86 aren't that great at it; you can write a Fibonacci calculator for the latter in 5 bytes and pushes and pops are single-byte instructions, while on the former even a register-register move is 4 bytes.
> That's only if your code happens to be particularly "64-bit-heavy", or the compiler isn't doing a good job at selecting registers
No, that's true for basically all code. 6 or 7 registers isn't enough for basically anything interesting, so you end up pretty much always hitting the high registers.
> Plus, what can be done with a single 4-byte instruction on x86 can require multiple 4-byte ARM instructions, and that adds up quickly.
The only real difference is that you have memory load addressing modes in x86, while for load-store architectures like AArch64 you don't. But:
* On x86-64 you have two-address instructions, not three-address instructions. This means that AArch64 "sub x9,x10,x11", or "49 01 0b cb", becomes x86-64 "mov r9,r10; sub r9,r11", or "4d 89 d1 4d 29 d9": 4 bytes vs. 6, thanks to the doubled REX prefix.
* On x86-64 immediates are very inefficiently encoded, while they tend to be compressed on RISCs to fit in the 32-bit instruction word. The end result is that AArch64 "sub x9,x10,#1234", or "49 49 13 d1" in 64-bit mode becomes x86-64 "lea r9,[r10-1234]", which is "4d 8d 8a 2e fb ff ff": 4 bytes vs. 7.
> I can't find it at the moment but one of the studies I remember comparing the binary sizes was using GCC, which is widely available and free, but probably one of the worst compilers at x86 size optimisation I've seen.
LLVM is doing pretty well at x86-64 size optimization: for example, it prefers to select lower registers to reduce size. As I recall, Dan Gohman told me the code size win was something on the order of 2%. It really doesn't make a big difference: AArch64 and x86-64 have about the same code size.
> you can write a Fibonacci calculator for the latter in 5 bytes
But real code, again, hits the high registers.
> pushes and pops are single-byte instructions
Pushes and pops aren't used by most compilers, except in function prologs and epilogs. This is actually an example of inefficiency in the design of x86-64. The opcode space shouldn't go to functions that are only used to set up and tear down functions.
> on the former even a register-register move is 4 bytes
"mov r11,r12" is 3 bytes on x86-64. Not a big difference…
I even recall a remark in that study about how it was generating mostly RISC-like instructions,
There is a very good reason for GCC's x86 backend to do this. Intel/AMD optimisation manuals provide a subset of x86 instructions that are worth using. Instructions that are actually fast in modern designs, that don't fall back to legacy microcode.
This subset looks very RISC-like.
Sure, x86 has CISC instructions that sometimes allow very dense code, but if you want your code to actually run fast you need to do it the RISC way.
I would love to see a newer CISC that didn't have all these required prefixes, and took a more Huffman encoding perspective so that instructions like hlt aren't allocated a single byte. Memory to memory ops essientially let you encode physical registers without using architectural registers, saving additional bits too.
And RISC-V encodes programs into fewer bytes than X86, so it wins in that regard. But there are still no implementation that have all the other features needed for top performance. There are many factors that affect performance.
Not only man different features contribute to performance but also that put performance depends so much on use-case. Two CPUs implementing the same arch might each win a benchmark that has a different instruction mix or memory access pattern, etc.
Nope, you don't need anything like the complexity of x86 decode in order to reach good instruction density. RISC-V matches x86 with a simple "compressed" extension involving 16-bit forms for some of the most common instructions (but still quite straightforward to decode in hardware).
I agree. Its almost always more performant to intentionally build a system that to have characteristics that another system incrementally evolved to have. This is why rewritten software (if actually delivered) often performs better than the original, its usually this and not new framework X or new language Y that resulted in the win.
Not being any kind of expert in processor microarchitecture, I wonder if there is potential in just actually compressing instructions, as in with a huffman table. Could hardware decoding be made fast enough that the cache and fetch savings would make up for it?
Modern ISAs like RISCV come up with their space conscious portions of their ISAs by taking a sort of human encoding perspective on the instruction stream. The definitely iterated on that base concept of allocate the number of bits needed based on frequency in the stream.
Apple is already doing this (transparent memory/cache compression) on the GPU side, and there is some speculation they are doing it for the LLC on the CPU side, or may start doing it soon, based on patents they have filed.
If you look at the benchmarks for x32 (ILP32) you see an improvement of up to 10% over x86-64. Granted this most benefits pointer heavy code, but is still doesn't make up for the lions share of performance.
That kind of benchmark shows the data path cost of big pointers but misses much of the instruction path cost of inefficient encoding, because the cost doesn't usually manifest as bottlenecking at the frontend. The cost is having a frontend that can keep up. On Intel cores this involves having a huge decoder that can deal with arbitrary alignment and unbounded sequences of prefixes; and having caches both for instructions and decoded uops (multiple types of the latter). Plus fundamental limitations on what the backend can do: there would be no point adding a 4th vector op x-port, because the frontend is miles away from being able to keep up with that many instruction bytes. All of that, and still the programmer/compiler walks a knife's edge avoiding frontend bottlenecks, trying to keep code tight and aligned so the important parts fit in loop buffer or at least uop cache and don't have to squeeze through the decoders. Use it or don't, REX is paid for.
> Instruction density is very important especially since caches are big and consume a lot of power too.
Which hurts x86 a lot, because x86-64 is very space inefficient for a variable length ISA. The REX prefixes add up to make x86-64 just as space-inefficient as AArch64.
Except that nearly all the RISCs have a 16-bit instruction extension which brings their instruction density nearly equal to the CISCs' one.
And the decoding remain much simpler than the CISC's one.
That said ForwardCom blur the bondary between CISC and RISC..
In my experience, ARM Thumb (which is ARM's 32-bit take on dense encoding) makes code which is about 75% of the size of the equivalent x86 (i386) code. I am sure ARM, if there was enough demand, could make a 64-bit version of Thumb.
I'm actually sort of surprised that ARM doesn't do that. Given that they have Thumb sitting there for them to use and given that i-cache pressure is an important consideration sometimes I would tend to assume it would be a no-brainer. Given that ARM's architects know more about this than I do I have to conclude that there are more disadvantages to even simple variable length instruction encodings than I would assume.
OTOH, if you can move a couple billion transistors from instruction decode and reordering to the L1 cache and execution pipelines, you'll get less cache pressure and deeper pipelines.
There is no magic - down at the back-end, x86's and ARMs are doing the same thing and getting the same performance will cost about the same chip area and power. Where they differ, however, is on the front-end, the pieces that decode instructions, issue micro operations and schedule them through to the execution units. In that space ARM64 seems more promising.
Unless, of course, Intel decides to throw away a lot of the backwards compatibility and goes very low-transistor-budget for legacy instructions (maybe software traps), freeing a ton of chip area to implement the ones they care about in a fast and power efficient way. That x86 would be unable to run DOS, but I don't think many of us would care.
OTOOH, you can't just add a couple of MB of L1 cache without increasing delays and even trying to add to other layers would result in much more complicated lookups.
You may be unable to significantly enlarge the caches, but you could assign a larger cache to each core or privilege level, a larger, less shared, L2 or L3 or even go like the IBM z and implement an L4 cache.
>CISC vs RISC is almost completely irrelevant these days.
This nonsense keeps coming up. No, it's not irrelevant. It matters. A lot.
A CISC design is complex, but it doesn't stop there. This complexity spreads down the chain. Implementations get complex, bugs happen. Making formal proofs of an implementation's correctness becomes impossible. Writing a compiler back end will be complex. Debugging it will be complex. Writing a proof that the machine code meets both the ISA specification and implements the same thing the higher level language does is also complex.
Now, where's the advantage of CISC to justify this complexity? Yeah, right.
Back when the terms CISC and RISC came out they were terms for whole collections of traits that inevitably all came in one package or the other. But since the 80s the world has become more complex. The coherent philosophy behind RISC has ceased to make sense for the highest end computers just as CISC had ceased to make sense for them when RISC rolled around.
We don't need to reduce the number of instructions to fit the whole processor on a single piece of silicon any more. With the decoupling of the processor and memory clocks and the introduction of caches load-store is a less pressing matter than it was. And the complexity of a processor is dominated by the fiendish complexity of executing operations out of order while still appearing in order to all outward appearances, even in the face of interrupts.
The legacy of these styles is still with us in the ISAs that were defined back then and the complexity of decoding an ISA can still make a small but noticeable difference. It can even have security implications when a sequence of bytes could be read as one of two valid instruction streams depending on where you start reading. But most of the architectural complexity of a modern processor has very little to do with the ISA and whether the architecture it was originally written for was RISC or CISC.
> load-store is a less pressing matter than it was.
I'd guess load-store with sufficient architectural registers is still an advantage if you're doing an in-order design, as that allows the compiler to schedule the load as early as possible? Sure, for an OoO design which splits a load-op into separate micro-ops this doesn't matter.
You might want to look up the ARM instructions "FJCVTZS" and "AESE", and "SHA256H".
ARM is a CISC processor these days. AESE has a throughput of once-per-clock cycle, as well as "Floating Point JAVASCRIPT convert to signed rounding to Zero".
AESE is in particular a very "CISC" instruction, because it is usually macro-op fused with an AESMC instruction. The ARM decoder will look for AESE + AESMC instruction pairs and execute them as a single macro-op (kind of like how x86 joins "cmp" and "jnz" instructions together into a singular op).
None of these instructions follow the "RISC" paradigm. As soon as processors hit the real world, it turns out that CISC instructions that accelerate common tasks (like AES Encryption: done by every web server and web-browser today) or even Javascript-specific instructions, is a good thing.
---------
ARM's "CISC" roots go deeper than that. For a long time, ARM machines had a "Jazelle" mode which directly executed Java bytecode. When Java-for-phones became less popular, Jazelle support was dropped.
But in any case, the "CISC" advantage is that you get instructions designed for the applications that run on your system. And lets be frank here: AES Acceleration just makes sense these days. Everyone uses a web browser or a web server.
Honestly, x86 is falling behind the CISC-wars. Intel should try to catch up and implement a Floating-point Javascript Convert assembly instruction as well, to improve those "Geekbench" scores.
-----------
CISC vs RISC is stupid. The purported advantages of RISC have been ported to CISC, and vice versa... now as ARM and Power9 support CISC-like instructions like AESE (ARM) and vcipher (Power9). All processors will have CISC-instructions to accelerate their most common tasks these days: there's a lot of extra die space (especially because large portions of the die have to be kept 'off' to help distribute the heat. So rarely used specialized instructions are very useful for heat-distribution purposes).
The biggest advantage to "RISC-V" is the ability to add application-specific instructions to the core. That's innately a CISC-design: custom instructions to support everyone's favorite optimizations.
The original design of "RISC", that is REDUCED instruction set, is incompatible with today's cheap transistors. You can fit billions of transistors on modern systems, so there's almost no reason to reduce your instruction sets.
> You might want to look up the ARM instructions "FJCVTZS" and "AESE", and "SHA256H".
The AESE isntruction is a perfect example of that.
The AESE instruction is a complex instruction that does not execute in the RISC pipeline, but it doesn't add major complexity to the decoder. That makes it very similar to the multiply and divide instructions on the original MIPS CPU: those operated outside of the canonical pipeline as well, with results stored in the HI and LO registers instead of the general purpose register file.
Yet I don't think anyone will argue that the 1985 MIPS was not RISC. ;-)
> but it doesn't add major complexity to the decoder
AESE + AESMC is seen as a singular 8-byte instruction from the decoder. The two instructions are decoded as one operation to maximize the throughput of AES. Yeah, web browsers + web server workloads demand an incredibly fast AES, to the point that ARM is willing to complicate the decoder just to make this one operation faster.
> Yet I don't think anyone will argue that the 1985 MIPS was not RISC. ;-)
1985 MIPS wasn't macro-op fusing together instructions for performance gains. The AESE + AESMC instruction pair is straight-up a CISC design (borrowed from Intel's cmp + jnz fusion), complicating the decoder severely but adding huge performance increases in practice.
> That makes it very similar to the multiply and divide instructions on the original MIPS CPU: those operated outside of the canonical pipeline as well, with results stored in the HI and LO registers instead of the general purpose register file.
AESE and AESMC operate on NEON registers. They coincide with the NEON Pipeline and NEON Registers. They're not "outside" the core by any stretch of the imagination.
As you can see in the diagram, the FP0 and FP1 pipelines are all that exist in A75. The AESE / AESMC instruction pair is decoded as one instruction. Its execution is in the FP0 pipeline, just like any other NEON instruction. These are all very CISC-like design decisions.
> AESE + AESMC is seen as a singular 8-byte instruction from the decoder. The two instructions are decoded as one operation to maximize the throughput of AES. Yeah, web browsers + web server workloads demand an incredibly fast AES, to the point that ARM is willing to complicate the decoder just to make this one operation faster.
But that's an implementation decision and not inherent to the ISA itself: AESE and AESMC can be implemented as individual instructions without fusing them.
The fact that RISC-V advocates fusing certain pure RISC opcode patterns for improved performance doesn't make it any more CISC either.
> AESE and AESMC operate on NEON registers. They coincide with the NEON Pipeline and NEON Registers. They're not "outside" the core by any stretch of the imagination.
The key part here is "NEON Pipeline and NEON Register". If you can make the split between the traditional pipeline and the additional pipelines, the complexity penalty is contained.
The issue with x86 CISCness has always been with having to deal with byte-aligned variable width instructions.
> But that's an implementation decision and not inherent to the ISA itself: AESE and AESMC can be implemented as individual instructions without fusing them.
On the contrary: the compiler and programmers have performance expectations. Just as modern compilers/programmers expect macro-op fusion between x86 cmp/jmp pairs, modern ARM compilers/programmers expect fusion between AESE and AESMC.
Especially when we're talking about articles like "ARM processors like A12X are nearing performan parity...", we have to understand the architectural decisions ARM has made to get to this point.
Some CISC-techniques are really, really good for performance. As such, ARM takes those CISC techniques. AESE + AESMC is an excellent example.
---------
You can't write a compiler optimizer unless you have an idea of what the CPU core is actually doing. The expectation for any cell-phone ARM chip is to have fusion between AESE and AESMC. That's just how it works these days. Modern compilers are going to work very hard to put AESE + AESMC instructions next to each other to maximize the potential for fusion.
> The key part here is "NEON Pipeline and NEON Register". If you can make the split between the traditional pipeline and the additional pipelines, the complexity penalty is contained.
Ehhh? AMD Zen has a FP Pipeline and FP Register pools to implement x86. Basically the 64-bit pipeline+registers for normal instructions, and then 128-bit pipeline+registers for SSE / AVX / etc. etc. Does that make AMD Zen's implementation of AMD64 a RISC machine?
> The issue with x86 CISCness has always been with having to deal with byte-aligned variable width instructions.
ARM kind of has this instruction set called "Thumb2" you probably should get familiar with. Which by the way, all ARM8 chips still support if you put them into AArch32 bit mode.
So yeah, all modern ARM chips have a variable-length decoder implemented. Its part of their design. It turns out that variable-length decoder is really, really good for code density. You can compress more data to fit into the limited L1 cache when its variable length.
----------
EDIT: Its really, really difficult for me to consider ARM a real RISC machine. I'm sorry. It supports all of the PDP-11 addressing modes for goodness sake. (post-decrement, pre-increment, etc. etc.).
Or is "LDR R0, [R1, r2, LSL #2]" really a RISC-style instruction?
RISC vs CISC debate is dead, and has been for decades. ARM just takes good designs, just as all the other CPU designers do. If its a good design, it goes into the chip. ARM even copies the CISC-style "shadow registers" approach to data-dependencies. Intel Skylake has over 200-"shadow registers" internally that it assigns to RAX / RBX / etc. etc., and ARM cores do the same. The ACTUAL hardware registers no longer match the ISA on modern ARM designs.
> On the contrary: the compiler and programmers have performance expectations. Just as modern compilers/programmers expect macro-op fusion between x86 cmp/jmp pairs, modern ARM compilers/programmers expect fusion between AESE and AESMC.
I understand that compilers and programmers have expectations. I just don't see how that determines the CISC-ness of an ISA specification. (Again: see my RISC-V example.) That said, it's a bit of a side argument.
> Does that make AMD Zen's implementation of AMD64 a RISC machine?
No, it does not. The fact that you have different pipelines for different parts of the ISA enables maintaining a RISC-like architecture. That doesn't mean that it is RISC. For me, the complexity of the decoder is the determining factor.
> ARM kind of has this instruction set called "Thumb2" you probably should get familiar with. Which by the way, all ARM8 chips still support if you put them into AArch32 bit mode.
I thought Thumb2 instructions can either be 16-bit or 32-bit wide, like RISC-V compressed instructions.
x86 allows a free mix of 8 bit, 16 bit, 24 bit, 32 bit and 40 bits.
As a hardware designer, the former is minor decoder implementation nuisance. The latter is a nightmare.
I was not aware that THUMB2 allowed byte-aligned instructions of variable length.
> EDIT: Its really, really difficult for me to consider ARM a real RISC machine. I'm sorry. It supports all of the PDP-11 addressing modes for goodness sake. (post-decrement, pre-increment, etc. etc.).
Nothing is ever black and white. If somebody considers mixing 16-bit and 32-bit instructions CISC, then CISC it is. My bar is just at a different level.
> Or is "LDR R0, [R1, r2, LSL #2]" really a RISC-style instruction?
I don't think the point of RISC ever was a minimal instruction set. Heck, there have been (theoretical AFAIK) Turing-complete single-instruction computers. Is anything more than that CISC? Of course not.
A more useful distinction is perhaps "death to microcode". If you have an AES instruction, that is implemented with microcode running on the "normal" ALU's; Yes, CISCy. OTOH, if your chip has hardware for doing AES, adding AES instructions for using that hw doesn't sound particularly "un-RISCy" to me. Then again, many ARM chips apparently microcode some instructions, so meh..
Power9 and ARM both have register renaming and micro-ops. In fact, the SVE extension to ARM provides variable-length vectors, so that the "inner loop" of SIMD compute is stored entirely in micro-code space and independent of the ISA.
ARM-SVE seems very non-RISC to me.
----------
The only thing that seems common to all RISC designs is the load/store architecture, which x86 implements under-the-hood with microops now.
> Power9 and ARM both have register renaming and micro-ops.
And? That's an implementation detail rather than a feature of the ISA. Then again, to some extent so is microcode, so I'm contradicting myself. Ugh. Well anyway, although these days there seems little common ground in what makes a design RISC, CISC, or whatnot, I think you'll be hard pressed to have much support for the idea the reg renaming or micro-ops would be such a defining feature.
> In fact, the SVE extension to ARM provides variable-length vectors, so that the "inner loop" of SIMD compute is stored entirely in micro-code space and independent of the ISA.
Huh? I thought the point was just that the vector width is not a compile-time constant, but rather there's some instructions like "vlmax foo", which calculates max(foo, implementation vector length), and then you use that as the loop increment rather than a compile-time constant.
Not that the "inner loop" is stored in micro-code space (what does that even mean?).
> ARM-SVE seems very non-RISC to me.
To an extent I agree, but I'd say the most un-RISCy thing of SVE is not the variable length but rather the presence of scatter/gather instructions. I mean, in many RISC definitions there's the limit of one memory op per load/store instr., which makes sense as it makes e.g. exception handling much easier. But here with scatter/gather we have memory instructions which not only load/store multiple consecutive values, but potentially load/store a bunch of values all from different pages! If that isn't non-RISC, then what is.
But then again, scatter/gather is awesome for some problems such as sparse matrix calculations. Practicality trumps ideological purity.
If we're talking about desktop processors, it's still irrelevant, honestly; any desktop processor that can compete in a modern world is extraordinarily complex in implementation at every level. POWER, ARM, RISC-V, x86 -- they all try to extract high performance using similar techniques in their execution engines. Out of order execution, pipelined decoding etc are architecture design aspects and have little to do with the ISA. Front-end decoding, etc traditionally takes an extremely small amount of area and space in the overall system, though it's still complex (it's kind of hard to talk about raw "area" metrics when cache is what dominates and is most important.) But the CPUs themselves are incredibly complex, no matter what way you cut it.
If you're talking about embedded CPUs or whatnot, such as picorv32, the frontend is a relatively bigger piece of the "pie" so having a simpler ISA is nice. But really, verifying any CPU design is complex as hell and the decoders/frontends are not the biggest problem (even in RISC-V a source of bugs I've seen float around and even fell into myself, for my own emulator, is the semantics of JALR clearing low bits, for instance -- but that's not a decoding problem, it's a semantics one, flat out. Also, let's be real here, the whole memory-model approach in RISC-V where there's a spectrum of extensions adding TSO on top of weak memory ordering etc is nice and cool but I'd hardly call it fitting in the form of what people think of as a "simple" RISC CPU. That's not even getting into the array of other extensions that will crop up -- bitmanip, a crypto extension will almost certainly pop up, the vector extension, etc etc...)
Programmers love speculating and talking about the ISA and attaching words like "RISC" and "CISC" to everything. Everyone says x86 is a RISC not a CISC because "microcode" or whatever, but honestly who cares? Maybe it is, maybe it isn't, but ultimately it's just a superficial moniker for what you're fundamentally interacting with, and nobody who designs CPUs thinks this way anymore. It's an descriptive moniker from a bygone era, when today most systems have converged very closely in many of their core design decisions, and most of the differentiating features are things like various security extensions, interconnect support (because the interconnect is king), and software support. For things like designing CPUs or formal verification it's a small part of the overall job and you have about 9,000,000 other problems on your plate.
CISC chips provide the fastest hardware and most highly optimized software currently, which is arguably more tangibly beneficial to most users than a formal proof. This is an empirical conclusion, of course, and isn’t evidence that RISC couldn’t be as beneficial.
Is that really true? Last I heard POWER, a RISC, held the single-threaded speed record and various RISCs had been trading for the single-socket throughput. But of course POWER is very expensive and power hungry so most people at the high end use x86.
That's not what RISC means. The fact that it's using a different microarchitecture on the inside doesn't in anyway eliminate the complexity and challenges that comes from the ISA, not least the complexity for the decoder and branch predictors.
What Intel (and IBM's Z9) has demonstrated is that with enough enginering and silicon, you can still make complexity go fast. What RISC (like RISC-V and Aarch64) enables is the same kind of microarchitecture but with far far lower complexity (= time to market, design team, etc). Intel is still doing really well because the physical design (custom cells) is hard and takes teams that are hard to come by.
Things are very interesting right now and for the next few years.
Load/store vs. register–memory can still be relevant. Moreover, x86 still generally allows sloppier code to perform better, which practically means that a lot of non-x86 relative performance depends on the compiler. The rise of better compilers has changed the calculus here substantially.
It makes a difference. X86 decode is difficult. Trace cache helps decode, but with diminishing returns. As for consistency, it makes things difficult. Increases complexity of speculation and coherence.
But honestly, the way forward is quite obvious. Cores with self managed caches and ideally without coherence.
> The second area is memory consistency guarantees. x64 has relatively strong guarantees, allowing simpler assembly code but at the cost of more complex memory subsystem between cores. A64 has much weaker ordering guarantees, which saves on hardware complexity, but required the programmer (and/or compiler) to insert memory fences whenever stricter ordering is required.
I'm not a fan at all of weak memory models. Especially when it results (is it caused by that? probably?) in atomics being crazy slow compared to those of state of the art x86. Especially since multicore is crucially important and will continue to become even more, and there is no SW solution to a HW providing slow atomics.
x64's changes to memory ordering guarantees that would theoretically give it more room for optimization as compared to x86 are purely hypothetical. In practice, so much real-world code was written for and battle-tested on x86's extremely generous ordering, and Intel does not avail itself of performance gains it could do at the risk of completely breaking traditional software (even when not running in x86_64 mode).
Ok, where is the benchmark that shows the performance parity? All I've ever seen is comparisons to low power mobile chips. And then extrapolation from there.
You can't just extrapolate and assume your 2W (or whatever) phone CPU will be like a 95W desktop if you just stick on a heatsink & fan and feed higher voltage & clocks to it.
If it were that simple, the CPU manufacturers could fire a whole lot of engineers.
It's like, you know, my Honda Accord is reaching performance parity with your Bugatti. (If I extrapolate based on how much I think I'd get performance by sticking in a big turbo and new exhaust, intercooler, and higher RPM redline. It's that simple, right?! No, actually it's not..)
“In the space of one hundred and seventy six years the Lower Mississippi has shortened itself two hundred and forty-two miles. That is an average of a trifle over a mile and a third per year. Therefore, any calm person, who is not blind or idiotic, can see that in the Old Oölitic Silurian Period, just a million years ago next November, the Lower Mississippi was upwards of one million three hundred thousand miles long, and stuck out over the Gulf of Mexico like a fishing-pole. And by the same token any person can see that seven hundred and forty-two years from now the Lower Mississippi will be only a mile and three-quarters long, and Cairo [Illinois] and New Orleans will have joined their streets together and be plodding comfortably along under a single mayor and a mutual board of aldermen. There is something fascinating about science. One gets such wholesale returns of conjecture out of such a trifling investment of fact.” -- Mark Twain
Intel and ARM are just approaching the same points from different directions, that's all.
We can equally say that you can't extrapolate and assume that your 45W x86, with lowered clock and voltage, will fit in the power budget of a mobile phone and still give good performance. But all these companies are throwing a lot of engineering effort at it and making solid progress. There's not some reason why x86, the architecture, should be faster than ARM. Intel has been enjoying a process advantage for many years now, and an enviable R&D budget, but that R&D budget is spent fighting against diminishing returns and in the meantime, competitors are catching up. As long as everyone can buy roughly the same node (a big "if", but looks like we're close enough), diminishing returns will bring everyone closer to parity.
But, benchmarks show this a bit better. Geekbench (I'm looking at Single Core):
iPad Pro 11-inch (iPad8,1) has the A12X at 2.5 GHz, score ~5000.
iMac 27-inch retina (iMac18,3) has the i7-7700K at 4.2 GHz, score ~5700.
This is not a cherry-picked comparison... this is just a comparison of whatever happens to be top of the line in both categories. Note that multi-core benchmarks will paint a slightly different picture, but that's very natural, since you can get a Mac with 18 core Xeons. Presumably, adding more cores to a mobile processor when you switch to desktop and can handle proportionally higher TDP, while not trivial, is not especially difficult either.
I assume that different benchmarks give different results as well. This is just one benchmark I know has results for both platforms.
> Presumably, adding more cores to a mobile processor when you switch to desktop and can handle proportionally higher TDP, while not trivial, is not especially difficult either.
Multi-core scaling past a relatively low number is extremely difficult. AMD & Intel both have complex interconnect fabrics to handle this. You can see an example of this with Intel's shift from a ring bus to a router mesh design: https://www.anandtech.com/show/11550/the-intel-skylakex-revi...
Or AMD's new chiplet design with Epyc 2 of the IO die + core dies connected via InfinityFabric.
> iMac 27-inch retina (iMac18,3) has the i7-7700K at 4.2 GHz, score ~5700.
Be careful with those clock speeds because the Intel Core i7-8700B @ 3.2 GHz shows a nearly identical score.
And just below that supposedly the Intel Core i9-8950HK @ 2.9 GHz comes in just a bit lower at 5348
And the Intel Xeon W-2170B @ 2.5 GHz scored 5100. Higher than the A12X at the same clocks. That'd suggest Intel has an IPC advantage.
Of course the reality here is that the clock listed on that page are complete nonsense. It's not the clocks the CPU was running at while running the benchmark, which is a hugely important data point to have here given the variety in boost frequencies at different thermal/power situations. The i7-7700k wasn't running at 4.2ghz, it was probably running at 4.5ghz. And the i7-8700B wasn't running at 3.2ghz, it was probably running at 4.6ghz.
So depending on the actual clocks during the run the resulting scores may be more or less impressive. Similarly we don't know the actual power draw during those runs, which since we're talking about power efficiency here matters quite a bit.
You make a great point. I remember when I got my 12.9" iPad Pro (gen 2), and the graphics were fast as hell - Waaaaaay faster than my poor Mac Mini could do.
Based on that (and yes, both devices are now "old"), I'd trade my Mac Mini performance for the performance of my iPad without hesitation. Bring it, ARM !
I was going to post a text-heavy reply to you about that, but I decided not to. Essentially, my family got an iPad 4 and bluetooth keyboard case. A few years later, when Apple was prepping for their iPad Pro and positioning tablets as desktop replacements, I found that the work that I did was much easier on the iPad than it was on the Mac Mini (quad-core) that we had, including doing text editing and stuff of the sort.
Personally, if someone put an ARM laptop in front of me with a workable, non-spyware OS, I would take it in a heartbeat.
Just going to nitpick a bit here and point out that Apple has never used ARM's graphics. They started with imgtec and then Apple took over at some point, using largely the same architecture though.
Meanwhile your Mac Mini was using some form of Intel's integrated, which is notoriously trash tier. The very newest ones are OK sort of, but when push came to shove even Intel opted to ship AMD's Vega graphics in their own NUC instead.
Geekbench is, IIRC, a pretty non-representative benchmark. Nowhere is this more clear than looking at Samsung Exynos Geekbench results vs. their real-world performance. It's okay for comparing iterations on an architecture (e.g. A11 vs A12, SD845 vs SD855), but it falls apart a bit otherwise.
Five days a week you drive fifteen miles to work over roads constrained by traffic, getting up to 60MPH for two minutes and spending the rest of the trip at 0-30. On weekends you sometimes drive two hours to visit friends or go to a special event.
The fact is that the Honda is much more appropriate for your life than a Bugatti; 5 days a week you don't do anything that the Accord can't do, and on weekends you could take the money you saved by not buying the Bugatti and rent a Corvette or a Ferrari, and still come out ahead.
Most desktops and laptops spend 5 days a week being idle and waiting on RAM, SSDs, spinning disks, or network data... or worse, waiting for human input. During that time, you might as well have a cheap ARM. When you ask for peak power, mostly you could rely on an outboard GPU.
My company does some embedded work. The problem with your analogy is that edge-cases drive CPU purchase decisions far more often than you would normally think.
A simple example: You have designed an ATM using a lower powered ARM CPU. 99.9% of the use-cases never tax the CPU beyond 30%. However, in specific cases where the ATM must simultaneously access the bill dispenser, printer, and note acceptor the hardware interrupts overwhelm the CPU and leads to 2-3 secs of "CPU lock". After those 3 secs, everything returns to normal.
The problem is that the bill dispenser protocol only allows for 1000ms in delay in ack to messages, thus when this situation occurs, the lag is longer than the protocol allows, so it errors out and FUBARs the entire transaction.
In this case, even though this only occurs .1% of the time, the existence of this edge-case will mandate that you not use this specific CPU.
Spoken another way: The ongoing issue of certain edge-cases will cost much more in the long term, than the extra $50 for a different CPU.
Great. But you seem to have missed that I was just providing an example.
However, to further the requirements. These days many people throw around the term "ATM" sort of indiscriminately. So, they call everything from a simple 3rd party cash dispensing kiosk to full-on bank automation centers "ATM's".
The problem is, that for a simple cask dispensing kiosk, you may be totally correct. Linux + ARM may work fine. However, the more functionality this device is suppose to have (ie. do 90% of the functions of a real bank teller: deposit checks, cash checks, etc), then the more these edge-cases become an issue.
Because of this tight integration with backend banking system, and regulatory requirements, these edge cases will become magnified (i.e. who says you can use linux?). Additionally, since all other players in this market also have to deal with all these issues, the $50 difference in CPU cost is totally absorbed into the rather high-dollar price tag associated with this automated teller.
The title's original claim is reasonably supported in the article. I'm pointing out that even if ARM development mysteriously stops making gains against x86 development, it's entirely reasonable to think that Apple might announce that the next revision of OS X has an ARM version that runs on their new 16 hour Macbook Air and their new 14 hour Macbook.
I confess, I hadn't finished reading the original article when I posted my previous comment so I was operating under the assumption that ARM single-core performance was significantly further from parity than it apparently is. My original objection was that an external GPU can't shield you from lag and unresponsiveness in normal, everyday (ie. not designed to take advantage of GPU compute) software, but if the single core performance delta between an A12 and a reasonably recent desktop i7 is genuinely only 20% then that's not an issue.
I really think people put more focus in the RISC vs CISC dichotomy than it deserves. ARMv8 CPUs have microcode, a mixture of instruction sizes, instructions which take more than one clock cycle to complete, and a very large number of instructions overall - i.e. probably doesn't fit into most traditional definitions of RISC.
The challenges that ARM face while competing with x86 are software maturity, and moving away from low-cost low-power designs towards larger and more performant design (which will consume more power and cost more than traditional ARM designs)
I am frequently asked about that. This is what I tell: ARM feels free to do a complete ISA revision from ground up once or twice a decade, x86 on other hand is still bound to incremental improvement on a frail foundation laid in seventies. X86 only had three major revisions 16 bit to 32 bit and to 64 bit (the last one wasn't even done by Intel itself.)
x86 biggest weakness is its dependence on wintel that precludes them from the very needed major isa revisions.
If you look at atom dies, the overcomplicated decoder and other x86 vestiges take more area than the rest of the core.
A12x is remarkable in that it gets close to 15 watt Intel CPUs with LESS die area and lower power consumption. And if you remove the useless things like NPU, DRM stuff, security coprocessor, and other useless peripherals from calculation, the comparison will really begin to look dire for Intel.
Adding to that, even if we take into consideration that Intel is still on 14nm and A12x is a 7nm part, A12x still wins even if Intel will make a die shrink on 7nm. And you also have to consider that Intel has squeezed all and everything in terms of power efficiency from 14nm node after 5 years of active development on it, while Apple really went for the very first baseline revision of TSMC 7nm.
Moreover, what I hear from the scene here in Shenzhen is that in A12x Apple did not really put much into power saving: A12x has nothing comparable to Intel's complex runtime power management, power and clock gating, separate power domains, and on-package smart dc-dc converters. If Apple will commit itself to squeezing more power efficiency from their chips with equal zeal, I believe they add additional 25-35% to their power advantage.
If you look at atom dies, the overcomplicated decoder and other x86 vestiges take more area than the rest of the core.
The Atom is a bit of an edge-case since it has barely any cache (and the performance is exactly what you'd expect from that), yet it still takes up a significant amount of the die; in all other CPUs, the caches are far bigger.
The difficulty of decoding multiple x86 instructions in a single clock cycle is a small issue in the context of a big out-of-order core. But it's a millstone around Intel's neck when they try to make a simple superscalar in-order core like the Atom.
The biggest difference between x86 and ARM for a big out-of-order machine is probably the much stronger memory ordering guarantees that x86 provides. There are good points to be made on either side of that but it's clearly a very important consideration.
The second biggest difference is that, in a big OoO core, the cost of decoding the instruction stream is trivial on ARM as opposed to merely small on x86. Something like 5% of the cores power budget on a modern x86, I think? One upshot of that is that on Intel chips the decode stage is balanced with the overall execution width so that they are only seldom decode limited. Whereas the attitude on high end ARM chips is that you might as well just throw in more decoders than you think you need so you can stop worrying about it.
But overall ARM has traditionally been the CISCiest of the RISC architectures and x86 has been the RISCiest of the CISC architectures,[1] making me think they might have both ended up successful by being at a sort of happy medium.
x86 has been the RISCiest of the CISC architectures
I agree completely if you mean "RISCiest" as in the complexity of decoding instructions. Look at the opcode map of a VAX, for example --- instructions were simply added where they fit, so there's no easily discernable pattern in the bits of the first opcode byte. x86 has an octal structure to its instructions[1] and the first quarter of the opcode map (000 through 077 octal) contains nearly all the commonly used ALU operations.
The 8086 along with the Z80 and its predecessors had to be implemented on a single chip, which put constraints on how complex its instruction encoding could be, which could explain why a regular structure (but not fixed length) was adopted instead of a more "true CISC" way of making every instruction opcode arbitrarily increasing and microcoding everything.
1. L1, L2, L3 cache sizes. Since these are implemented as S-RAM, they can drastically increase chip area, and hence cost.
2. ISA Acceleration of certain types of algorithms - like encryption, codecs and AI.
There's absolutely nothing inherently lacking in the ARM ISA or architecture to impact #1, and ARM has consistently been adding instructions for #2, including SIMD support.
Apple chips perform so well as Apple can afford more cache for the same $ cost, as they make their own chips. Android phone manufacturers need to pay more for Qualcomm to make their profits
A12x advantage comes not only from gigantic caches, but in how efficient is the core in using them. Anybody well versed in microarchitecture design will tell that at some point increasing cache size will actually begin to slow you down.
Both cache topology, address lookup logic, prefetch, and the dark magic like smart cache invalidation circuits matter a lot. The goal is to flush as few bits on cache miss as possible, find stored values in cache faster, and do prefetch efficiently. Only once you can do that, the cache size/core complexity tradeoff begins to work.
Any normal CPU benchmark will be measuring a non accelerated workload. So, the measurements of A12x having a true lead thanks to wider pipeline, bigger caches, and smarter cache management are something very well expected.
Agree. But the presumption is that eventually, cache management algorithms will reach parity with the competition/bleeding edge. Any and all parameters that can be tweaked will be. All top CPU arch companies hire from the same pool of PhDs/M.S' from top schools, after-all.
Performance commoditization is truly upon us. ARM v.s. x86 is quite literally a battle of existing software binary support. Things like WASM and JIT will further obviate the need to worry about ISAs when making product design choices. As of 2019, you can't go wrong picking an ARM based design, given you picked a SOC with sufficient cache, etc. that your use case demands.
On your position, I don't agree. Purpose built cores are there to stay. There is certainly no definite optimal cache configuration. And even inside ARM space, approaches vary dramatically (in part because cache algos is a patent minefield second only to wireless:) Samsung went on largely to non-determenistic algos for oop machinery, cache, and branch prediction; Qualcomm more or less went the Intel way by making fever, but faster execution units, and adapting cache for that; ARM always had size and cost in mind; and Apple did as aforementioned.
Even very minimal changes to performance profile of execution side may turn things upside down for people engineering caches.
I think the stakes are so high, and funding so deep, that eventually cache management will reach parity, regardless of the choices made to get there. I'm not an insider to dissect algorithms academically, but if in the worst case one needs to license the competition's patent, that'll happen.
Keep in mind that the cost of non-performancen is irrelevance in the market, and that's a non-option for the likes of Qualcomm, Samsung, etc.
If it is competitor's, but much likely of a patent troll, whom are as many in this area as I said in wireless. Cache algos today are said to be designed very specifically to work around known traps, even if it meant going for suboptimal solutions.
About to go back to Shenzhen this year. Been canvassing all kinds of weird places for 14 months now as part of a job. Angola, Kenya, Malaysia, Kazakhstan, Pakistan (just trying to sell a freaking water vending machine there, nothing more.)
I'm asking because I am very curious about the industry.
You mean there's an upper limit to performance benefits of increasing cache sizes. The more important comparison is between a desktop CPU and ARM mobile. You'll quickly notice performance parity when cache sizes are increased on the mobile CPUs. And also notice Qualcomm's 845 and earlier lagged Apple's CPUs in cache size.
The performance gains certainly plateau, but mobile caches weren't even trying to approach this limit, till now - explaining a big part of the performance gap between desktop and mobile
> You mean there's an upper limit to performance benefits of increasing cache sizes.
A cache is a latency-hiding device over a working set of unknown size. The latency of the cache depends on its size. So picking a cache size embodies a huge set of unknowns (the programs that will be run, their working set sizes, latency and size of main and swap memories etc.). In any case, for a given workload you could draw up a graph of cache size (and implied cache latency) vs throughput / latency and see that there is a sweet spot (larger cache does not substantially increase performance, but rather degrades it, due to increased cache latency).
Furthermore, when you do a shrink, you could put the implied performance gains into making a bigger cache of the same (absolute) latency, or making a cache of the same size with lower latency; the latter might be implicitly required since you now probably have a faster core that wants a cache with lower absolute latency.
From a CloudFlare article: "In our analysis, we found that even if Intel gave us the chips for free, it would still make sense to switch to ARM, because the power efficiency is so much better."
Still no real benchmarks from the Qualcomm ARM boards Cloudflare is using, or availability to anyone else except Cloudflare. (Unlikely to ever happen now with the CPU being discontinued, seems pretty damn unlikely it was ever really very competitive. )
Impressive: "With the NGINX workload [Falkor/AMD] handled almost the same amount of requests as the Skylake server. [Falkor/AMD] managed to get 214 requests/watt vs the Skylake’s 99 requests/watt"
It's not entirely clear what the power consumption is actually measuring. If it's not whole-system draw then it could be highly misleading. What is considered "CPU power" and what doesn't can vary (like does it include the memory controller? Or PCI-E controller? etc...)
Savings of 60W per CPU for equivalent performance works out to be say $60 per year in electricity.
However if your data centers are capacity constrained by total power supply, then you can double the capacity of your data center by going for a CPU that is twice as efficient (2x is the approximate difference that they measured between AMD and Intel).
CloudFlare have unusual needs where ARM is very competitive, unlike desktop usage. From article:
“Every request that comes in to Cloudflare is independent of every other request, so what we really need is as many cores per Watt as we can possibly get,” Prince explained. “The only metric we spend time thinking about is cores per Watt and requests per Watt.” The ARM-based Qualcomm Centriq processors perform very well by that measure. “They've got very high core counts at very lower power utilization in Gen 1, and in Gen 2 they're just going to widen their lead.”
This. The ARM server class chips were mostly developed vendors of networking chips - Cavium, Qualcomm, Broadcom - for things like NAT boxes, programmable routers, Deep Packet Inspection, etc. They have a very high core count and proportionately less cache than standard server chips, but have been morphing to tackle more standard server loads.
So the ideal arch for that would be non-CPU raw state machine, like Silego GreenPAK5 ASM cores, but they would better ask Dialog to remove all the peripherals, make state machines a bit bigger, and pack a thousands of ASM cores into one chip. And then sign NDA, get low-level proprietary format docs, and write the state machine code generator in Haskell or Coq.
Sounds fun, but still on your hypothetical scenario: all that complexity and proprietary cores and ndas would essentially tie them to the supplier of chips, essentially gagging the evolution of the whole CF architecture in the long run.
If the race is for request por watt alone, then you're probably right, but there's always real-world grittiness that needs to be addresses. that's how Google succeeded right? leveraging common platforms and hardware to their highly specialized software combo.
For the average desktop PC user (excluding gamers), the electricity costs for the CPU itself are probably largely irrelevant. A well-built modern PC will use less than 10-15W while idle (i.e. while you are reading this), meanwhile typical 27" displays will consume 30W or more.
> For the average desktop PC user (excluding gamers), the electricity costs for the CPU itself are probably largely irrelevant.
Using average German electricity price (which are some of the most expensive you can have) of 0.33 EUR / kWh, even if you are playing six hours a day with an average power consumption of 400 W (which you can only reach using a high-end PC and a big, bright monitor), you would still pay less than one Euro per day on electricity for that. The cost of the hardware makes that operating cost pretty much irrelevant.
You could strap together 3x ROCKPro64s (6 cores, 4GB, $80 each) to make a half decent docker swarm or kube, if you've got the right kind of parallelizable dev workload.
However, I'm not sure how it would perform compared to my somewhat obsolete AMD Athlon II X2 260, and other posts have convinced me the power savings wouldn't necessarily be significant enough to pay for it.
From what we are seeing with the AV1 video decoder dav1d, where we wrote a lot of assembly by hand, the A12X can do 40 fps when a desktop can go beyond 120fps. (4cores)
So, it is getting closer than ever before, but we're still quite far to closing the gap.
This sort of very specific benchmark is never particularly useful as single instructions on one architecture or another can yield magnitude differences.
The T2 security coprocessor in the Mac -- a minimalist ARM implementation to manage a few very specific parts of the Mac -- does HEVC encoding (e.g. dramatically more complex than decoding) thirty times faster than the Intel chip it sits beside. I can't say it's a 30x faster chip, however.
High TDP Intel chips definitely are much more powerful than Apple's A-series chips right now. It will be interesting to see what Apple can do with a larger power profile and active cooling, however. That's theoretical, but they should have an enormous amount of headroom to exploit.
It is important to realize that the T2 coprocessor is just an A10 cpu which didn't meet the binning criteria to be included in an iPhone 7 or 2018 iPad.
It happens to have hardware acclerated HEVC encoding.
While it's clearly re-purposing of their core IP, I think the A10 assumption was a mistake on the part of iFixIt, who then corrected their mistake realizing the core size was not the same (and it definitely isn't just binning). The HEVC performance is still surprising though because they get much better performance out of the T2 (whatever its lineage) than they get out of a 6-core Intel processor, the Intel integrated GPU, or even the AMD dedicated GPU. That's the power of optimized silicon.
No, they said "some random chip" (just to the right of the T2) is too small to be an A10. [1]
I did once throw a photo of a mac motherboard into an image editor and estimated the T2 package dimensions. They were a very close match for the A10 Fusion, within the margin of error.
But that mistake is what led to the otherwise completely uncited claim that it was a binned A10.
Clearly Apple copy-pasted their core, but the T2 serves such a novel purpose in the Mac, and has some unique performance requirements, that it seems very unlikely that they sourced it from the A10 reject bin. Much more likely they simply used of their ARMv8 cores with some custom IP particular for the Mac as they slowly moved the line to ARM. This is all just speculation though.
Tapping out a new SoC design costs Apple hundreds of millions of dollars and years of engineering time. Reusing those pieces of A10 silicon that had defects effecting only part of the chip is essentially free.
Every other SoC/CPU manufacture does this, sells the same piece of silicon at different clock-speeds or with half the cores disabled.
The fact that apple typically doesn't do this binning actually puts Apple at a disadvantage cost wise.
> but the T2 serves such a novel purpose in the Mac, and has some unique performance requirements
Sure, but the performance requirements are all a subset of what the A10 can already do. There is no need for a large GPU like the A10 has, the T2 only drives a tiny, low animation display. Yet the GPU takes up ~35% of the A10 die. There is also no need for such powerful CPU cores.
If apple were designing a custom SoC for inclusion in Macs to meet those requirements, it should logically be much smaller than the A10.
Yet we have this T2 chip which is basically the same size as the A10.
Binning is actually the exception. You can't find binned Snapdragon 855s, for instance. Or most Intel chips. Rejects are generally destroyed.
When Apple moved to the T-chips they also moved several system management chips to it as well (e.g. copy-pasting the design alongside the A10).
Software on a general purpose CPU is great, but it simply isn't nearly fast enough for many system management functions. For >4GB/second DMA to go through it (for security oversight, encryption/decryption, etc), for instance -- that simply eliminates it from being a binned A10 at the outset, which was never designed around such a high performance need. The specialized display controller for the touchbar is absolutely nothing like the very purpose-developed display controller in the A10, either. These are all differences that would make it a terrible hack for them to use an A10.
Until we have imaging of the T2 innards we can't say, but I'd say with 99.999%+ certainty it is simply not possible for it to be a binned A10. An A10 single core integrated on some new IP (with purpose-suited blocks that perfectly fulfill their roles concurrent with the general processor), sure, but not an A10.
Because that's the name of the chip in the Mac that Apple developed in house. They had that IP hanging around and put it in the only place they could - the security coprocessor.
See sibling comment that the security coprocessor is a repurposed A10. And then to answer the next question, because the HEVC encoder for iPhones has been optimised, and they can likely just use the same code that already exists for iOS.
I have wondered whether they should be using the GPU in the A10/T2 too, instead of the Intel integrated graphics. Would probably perform rather well!
I am not a hardware engineer, but my guess is that making a high performing single digit Watt processor may Take different skills than doing the same thing than a 15-28-45-90-135-160W (or more) sustained TDP CPU.
My only point is that assuming Apple makes the best low power CPUs should not necessarily imply they can make good high power CPUs. They may have to build up the competency over time just as they did in the initial iterations of the A series.
Once you're making wide out of order application processors the skills are pretty much the same for either. But it does take quite a while to do a new architecture from scratch and you would almost need to do just that to re-design the A12 for such high power targets.
On workloads that scale well with core count (like video) that should be trivial. You go with 4x the cores and 4x the power and you're there. The software industry has been struggling for decades to make it easier to write multi-threaded code and has been making inroads only recently for software that isn't trivial to parallelize.
You'll find large variance in anything if you drill down to a single specific enough test case. For dav1d I imagine comparing performance between architectures is largely the same as asking if the SIMD is extremely wide on them which is largely irrelevant outside of media benchmarks.
I disagree. SIMD can be used in numerous cases, and not only in games/video. And to the end user, it does not matter which part of the CPU/GPU is used.
Just where do you think the SIMD instructions execute? It’s part of the CPU architecture like everything else and the quality of the implementation varies so comparisons are not only possible but critical: there’s a long history of apparent optimizations not panning out due to shared internal resources, mode switching overhead, etc.
If someone writes a program with AVX instructions and one without, saying the second is one third the speed is completely disingenuous.
SIMD instructions aren't the difficult part of matching Intel's CPU architecture. Fast cache hierarchies, prefetching, branch prediction and out of order instructions etc. are the parts that have to be matched.
Well of course SIMD performance is part of the "hard part", because it implies things like very wide cache access, including perhaps unaligned access, high bandwidth between cache levels, wide register files, etc.
Plus a sane ISA extension strategy that actually gets people to use your SIMD instructions helps a lot (Intel hasn't done well here, but so far neither has ARM e.g., with almost non-existent support for SVE).
SIMD performance is one place that Intel is still fairly far ahead of both AMD and the ARM competition (with Apple far out ahead in that group).
I'm not sure what a sane ISA has to do here, the parent was comparing a SIMD accelerated program to a non-SIMD program and using that to compare two CPU architectures. Does that seem right to you?
Perhaps not, but I was referring only to the assertion that wide SIMD is not in the "hard part" of designing a CPU.
I happen to agree with the GP that comparing SIMD accelerated to not can give a misleading picture, and so deserves to be noted - especially when one party has relevant SIMD instructions but they weren't used for some reason.
This depends entirely on what you’re measuring: if my goal is to make my work faster and a compiler can generate something which runs faster, most people don’t feel that it’s not fair to use extra CPU features - they just want to compare the results which they’ll actually get.
I remember people whining in the 90s x86/Alpha/POWER comparisons that some compilers were aware of fused Multiply-Add operations, too, but everyone who was trying to make a purchasing decision just tuned them out since they wanted to know realistic FLOPS/$ ratios.
There's quite a few fallacies in this article. The worst offenders being that performance scales with clockrate, that different CPU designs can scale to the same clock rate, and that power use/heat is anywhere near linear in respect to clock rate.
However it is still impressive that the A12X could hit 80% of a fairly aggressive Intel design at 60% of the same block.
Certainly active cooling would help ARM chips sustain the performance they can get for short periods of time without active cooling.
So Arm is closing the gap and the A12x is a pretty impressive chip. Certainly plenty for many use cases met today by Intel Desktops/Laptops.
But to hit the same clock speeds Intel's using might well require Apple to add an extra stage in the pipeline and/or running cache at a lower fraction of the CPU clock. Not to mention increasing clock speeds without decreasing memory latency will also hurt IPC. Any of these changes would hurt IPC and make it that much harder to reach 100% parity with Intel.
I don't think he is assuming linear scaling with any of those factors.
He is saying that despite using a small fraction of the power, and running at a significantly lower frequency, the Apple chips have only a small deficit on Spec2006 (and, not noted, but for several benchmarks, no deficit at all - a lot depends on whether SIM plays a big role).
Under that scenario, it is reasonable to assume that if you say tripled the power budget and adjusted the frequency to match the larger power budget and "full size" cooling solution, there would be a jump in performance. They are not claiming it would be 3 (power) x 1.5 = 4.5x times faster performance, i.e., A-series chips would be many times faster than Intel chips.
I think it is entirely reasonable to assume it would be the in the range of 20% or more, however. Certainly Intel chips scale up and down based on exactly those factors.
In my opinion, in the case where apples-to-oranges comparisons are possible (low TDPs), Apple's newest chips are already faster than Intel chips. In a high TDP scenario, the same would be true with basic re-targeting of voltages, frequencies, etc (no uarch changes). We don't have the latter yet, and by the time we do we will probably see Ice Lake and Sunny Cove from Intel, so the pendulum may swing back the other way.
s/ARM/Apple/ really. They have a great chip development team, but they are focusing, obviously, on mobile and do not seem to be interested yet in desktop, and even less in servers, except possible as a byproduct of their mobile development.
It's not an obvious win like the last couple of architecture switches because of the end of Moore's law. Like the PowerPC chips they were initially using ran 68k code quicker in a non-JITing emulator than any 68k they could buy. They had to switch to a JIT with Rosetta, but that same perf distinction was still true for the high end for PowerPC/Intel during their switch.
Running x86 code faster in an emulator than on a real chip might not ever happen.
And in not too long the x86-64 patents will have expired all the way through SSE4... I think Apple making their own x86 chips is just as likely as switching to ARM.
I'm sure they have prototypes and I'm sure they will consider releasing it if they have a design that works well, but I doubt that a laptop chip will ever be the focus of their main development team.
> There's quite a few fallacies in this article. The worst offenders being that performance scales with clockrate, that different CPU designs can scale to the same clock rate, and that power use/heat is anywhere near linear in respect to clock rate.
I kind of hate the way you throw this out there as if it invalidates the whole point of the article. You sort of walk it back with the rest of your comments, but the damage is done within the first line, overemphasizing nitpicks at the expense of the greater picture. I guess such things should be expected on a forum filled with pedantic engineers... I'm guilty of this myself.
I hope people still take the time to read what is an otherwise interesting, fair minded view that proves the central point it sets out -- that ARM is not an inherently inferior architecture and that recent designs from Apple prove this handily.
I really don't understand the logic behind this article. First you test two chips one trimmed for sustained loads and then one trimmed for short bursts. You pick tests that prioritize short bursts and do not require cooling which then of course result in similar performance numbers. The gap in power/cooling requirements is then considered impressive but then you turn around and extrapolate how much faster or better the ARM chip could be if it had the same level of cooling as the desktop chip. Except this completely defies the logic in the first part of the paragraph that the power and cooling do not affect the performance in the short burst tests. Those ARM chips won't get faster, they will just have the same speed as they have today.
>> Except this completely defies the logic in the first part of the paragraph that the power and cooling do not affect the performance in the short burst tests. Those ARM chips won't get faster, they will just have the same speed as they have today.
I think the logic is that with better cooling, you could have the same ARM chip running at higher voltage and clock speeds, under sustained load, and the result would compare favorably for the ARM chip both on 'burst performance' and 'sustained performance' metrics.
What makes you think this is not true? Is there anything in fast ARM chip designs that makes them only optimized for burst loads and hence inherently unusable for sustained loads? In a sense, you could make the same argument for x86 desktop CPU's, seeing they are also not able to maintain boost clocks very long for sustained loads either.
The article specifically addresses this point: current ARM chips are mostly held back by the passive cooling of phones and tablets, which is a property of the device itself and not of the possible performance you could theoretically get from the CPU.
No, the fact that small, passive-cooled devices have to be thermally limited, and the same chips with better cooling can run at high speeds longer is not "Netburst all over again".
That's all true in the context of trying to boost voltage/clock speed on a chip that is already running close to thermal limits when using a stock active heatsink (which are already huge). And yet, even in that context people are able to overclock parts rated for 3Ghz to over 5Ghz using extreme cooling solutions.
In the context of building desktop-grade ARM CPU's we're not talking about 5Ghz overclocks, but about going from passive cooling in cramped space without any airflow, to something more like a laptop or workstation with active cooling. You also don't need to go all the way up to 5 Ghz, the A12X for example runs at a maximum clock speed that is 40% lower than the base clock of the i7 the benchmarks the article is referring to, yet it manages to be already quite close in terms of performance.
Look at modern laptops/tablets/mini-PCs using Intel CPUs. The exact same part, with different cooling systems and accordingly set power profiles, is used in different devices, with performance differing accordingly. Variants of throttling (or temporary boost, but that's the same with different names) based on cooling performance happen in lots of compact devices (x86 or ARM), which better cooling can delay or completely remove. Of course there's limits to that, and a 5W part won't just scale up to a good 50W part, but there is room there. These approaches weren't really a thing for Netburst, which was firmly a desktop architecture from the start and got pushed higher and higher.
they have less cache on board, less cores, different integrated gpu; they don't support higher frequency ddr and they have limits on the memory bandwidth.
are you sure you don't want to check in with any of those facts of yours before pursuing further conversation?
modern cpu architecture are built to fit their own constraints maximally. once you start changing voltage and make some part of the cpu hotter than the original spec you'll might very well find out whole part of the chip need to be shifted around or redesigned to spread the load differently.
of course a slower chip is better performing watt for watt, the whole point is that the relationship is not linear! that doesn't mean you can just upclock the chip adding cooling, neither that a upclocking a chip won't require a significant redesign.
those parts are already pushing their envelope, or are you implying Apple is specifically wasting money on their chips?
>> those parts are already pushing their envelope, or are you implying Apple is specifically wasting money on their chips?
I would say the envelope Apple is pushing with their designs is currently almost exclusively bound by the working environment their SoC's run in: limited cooling and limited battery. They probably spend more time optimizing their software to make more efficient use of their chips than optimizing their cooling solution, because there simply is no room for fans and airflow.
That does not say anything about how suitable these chips could be with better cooling though. The fact that they are optimized for low power does not mean they cannot, or be redesigned minimally, to run at higher clock speeds. In fact, that's exactly what Apple is already doing, by using virtually identical variations of their SoC's across iPhone, iPad and AppleTV, running at different clock speeds.
>> are you sure you don't want to check in with any of those facts of yours before pursuing further conversation?
>> this seems a good time to remind how clock speed and chip features are intertwined with yeld and impurities
I don't know why you need to be so dismissive and agressive in your comments, especially since so far you have not brought up anything countering any of the arguments made by anyone else in this thread.
Maybe you can address the observation already made by dotaro about Intel making literally 20 different variations of the same CPU's, scaling from the ULV end with low clock speeds and limited cooling options, all the way up to the HPC end where clock speeds, TDP, etc. are large multiples of what goes into the ULV parts? What makes you think this is only possible with x86 chips and not with the ARM-based designs Apple uses? Do you think each variation of an Intel x86 chip from the same generation is a completely different design that was built from the ground up to fit that particular use case?
You seem to be stuck on equating having the option of a better cooling solution so you can push the design of e.g. an A12 chip to higher clock speeds and close the already small gap with x86 chips, with going full-scale Netburst architecture with low IPC compensated by crazy clock speeds and ultra-deep pipelines, by means of nothing more than pushing an imaginary turbo button and call it a day. Nobody suggested that but yourself.
> Maybe you can address the observation already made by dotaro about Intel making literally 20 different variations of the same CPU's, scaling from the ULV end with low clock speeds and limited cooling options, all the way up to the HPC end where clock speeds, TDP, etc. are large multiples of what goes into the ULV parts?
Just because you can scale down an HEDT- or HPC-focused design to the point of making it run as a ULV chip under very challenging thermals, doesn't mean that the resulting chip will perform very well. We've seen this time and time again with x86 vendors trying and failing to enter the lucrative "mobile" segment. And it's not clear why we should expect a different outcome when mobile-focused vendors try the reverse play, by attempting to "scale up" their existing designs. One size very much doesn't fit all in the semiconductor industry.
Exactly. Part of the reason that Apple's chips do so well against Intel at a given power level is that there Apple is operating at the frequencies their chips are designed for whereas Intel is far away from their sweet spot. You can cover a variety of power targets with one microarchitecture but you're going to do so less efficiently when you're far away form your design point.
And while I bet you could overclock an Apple core somewhat if you used liquid nitrogen or whatever it will still have more logic between clock latches than an Intel processor does. That deeper pipelining means that Intel will be able to clock higher than you can for any given process/voltage/temperature combination. Apple has some very talented CPU architects and I'm sure they could design a high performance chip. But it won't be the same one that runs in iPhones.
If you mean "they have less cache on board, less cores, different integrated gpu; they don't support higher frequency ddr and they have limits on the memory bandwidth" then no, that doesn't address the point. Intel does have different chips with all of those changed. But they also have SKUs with all of those the same but different power budgets and different frequencies. And really things like cache barely use any power at all in the overall context of the chip. SRAM, especially the 8T SRAM Intel uses, is very efficient.
> you can't just 'cool away' leakage currents
You certainly can, the leakage current is going to be proportional to (1-e^(-qV/KT)) so better leakage will tend to go down exponentially at lower temperatures. Yes, I know what you actually meant and you're still wrong there because if you have better cooling you can accept more leakage, meaning you can use a process with a lower threshold voltage and accept the larger amount of leakage current that results.
> if you have better cooling you can accept more leakage
again, no, better cooling stave off the thermal runway effect. you can think of it as generating less leakage per voltage unit, which is quite the opposite as accepting more leakage.
higher operating voltage allows for accepting more leakage.
and there's still no indication that cooling a specific chip design (because we're not talking about chips in general here) would prevent thermal issue within the chip, because you can only work on reducing the average surface temperature opposite to the pins - or, you can't just pick a single part of the argument and disprove it in isolation, because the whole thermal issue is not a premise, but ties in the whole claim of "we could just make this chip run with cooling at it'll beat down intels" - which they cannot, because you can't just cool away leakage currents, the chip has to be meant to be cooled and to be run with higher voltage at design phase.
This part always annoyed me. In particular there was a guy on reddit after the announcement of the Ipad Pro announcement who was absolutely adamant that no-one would need a home console because the ipad pro had comparable GPU power and the demos proved that.
One thing he would constantly avoid though, was the fact that for short 10 minute sessions, that may hold true, but for extended play sessions the CPU and GPU would likely be throttled without any form of active cooling.
Otherwise that's a lot of heat applied directly to the back of a very expensive display and none of the components will last very long.
Correct, without active cooling basically every CPU or GPU out there will throttle after extended play sessions. The point isn't to compare an iPad to a workstation but rather an (Apple) ARM SoC to an (Intel) x86 CPU. The chip itself and the implementation in a device are different things. The intrinsic performance of the chip is what you see in those short bursts. The implementation performance is what you see in the long runs. Saying a chip is no good because it doesn't get enough cooling makes no sense.
Anecdata: I have a laptop and a desktop both with Intel i7 3770 CPUs. The laptop CPU always throttles even without extended use while the desktop never does it. What's the conclusion, that the i7 3770 is much better than the i7 3770 because the i7 3770 won't throttle after long play sessions while the i7 3770 will?
The article has its problems, but regardless of that, it seems clear that Apple has had the technology to launch desktop class hardware based on ARM for some time now. I'm not just talking about laptops - Apple's microarchitecture expertise makes it entirely likely that their fabled next Mac Pro is entirely ARM-based. If you vastly expand the power, die size and IO constraints on this microarchitecture, it seems to easily make it a better choice than Xeons based on sheer performance.
It's not as easy as it sounds, both technically and business wise (does Apple have enough economy of scale just on the high-end desktop compared to Intel? Doubtful.), but it's entirely feasible.
> Apple's microarchitecture expertise makes it entirely likely that their fabled next Mac Pro is entirely ARM-based
I would like to understand this statement better. In 2012, I read this article saying ARM chips were matching x86 chips and that we'd be seeing ARM desktops, ARM servers, and ARM laptops within 2-3 years. https://liliputing.com/2012/02/fastest-arm-chips-are-compara...
But it is now, 2019. Aside from my phone, access point, and tablet which are ARM based, everything else is still x86-64. Am I an outlier? What data leads you to believe that "entirely likely that their fabled next Mac Pro is entirely ARM-based"?
> does Apple have enough economy of scale just on the high-end desktop compared to Intel? Doubtful
I don't understand this statement. Is the volume of chips "manufactured" by Apple significantly lower than the volume of chips manufactured by Intel?
I would argue that x86's dominance is actually Intel's dominance. Intel just has a very good tactical position in terms of high-performance hardware. Intel is a better source for hardware reliability, platform reliability, logistics and being able to deliver in volume, mostly already has existing contracts in place, and is still one of the leaders in performance. Even AMD - which makes x86 hardware and should be the most easy to transition to - has trouble getting more than single digit market share gains despite making a mostly superior product performance wise for the first time in more than a decade.
Apple is affected the least by this Intel lock-in, since they are the biggest seller of high-performance hardware to consumers and are capable of doing their own support infrastructure. Moreover, they have been heavily investing and building up a very successful own processor unit themselves. Finally, Apple has famously transitioned their entire hardware platform multiple times already when it felt like their current hardware plaform didn't suit them strategically. They've shown to be capable of supporting their own oddball hardware platforms before Intel, when they were a lot smaller still. Given Apple's level of control over their own technology and Intel's recent stagnation, I think it's very likely Apple wants to move in this direction if they are capable of doing it from a business perspective.
The Mac Pro might be a bold product to begin with since it's so focused on professional users and Apple is really peculiar about having a bold vision an the high-end desktop. It would also be clear signal of their strategy to transition the Mac Pro to this new architecutre. The reason I threw in that sentence about economy of scale is that the high-end desktop is a small market for Apple, and it might not be worth for them to make large workstation class chips for such a market. Especially considering that Intel's high-end Xeon W-line is based on the exact same silicon as Intel's high end servers, and therefore Intel just makes a lot more high end chips than Apple would.
Of course, Apple would be able to work around this with a chiplet architecture like AMD is doing, but I feel that we're already too far in speculation territory.
Anything serious on arm runs the risk of throttling. Upping performance will see more consumption and power usage.
The even bigger problem is ARMs closed nature with no support for the open off the shelf culture that has made the PC industry what it is today. ARM is about closed SOCs, closed drivers and closed vendors and this in effect closes up the drivers and software ecosystem.
Something like Linux and the open source movement would not have happened with this hardware model, and it a paradox of our times that it is Linux developed because of the open culture of x86 that is used to support this closed model. There is something ironic even parasitic about this.
There are large forces of centralization and control currently in play and getting excited about a closed ecosystem becoming mainstream seems shortsighted for the tech ecosystem and consumers who have benefited from widespread choice, competition and the open source movement in x86.
One thing author misses is standard cell vs full-custom design. Most ARM’s are standard cell that I’m aware of. Intel and AMD do their x86’s with full-custom design. Standard cell lets you write in a high-level language that’s synthesized into low-level, logical form using combinations of building blocks (“standard cells”). Like high-level programming vs assembly, there’s all kinds of performance costs to this vs making a custom, low-level solution ideal for the problem and process it runs on. Like doing huge apps in assembly, you need specialists that might cost more doing work that will take way, way, way longer with more difficulties in verification (more rework).
I don’t know if Apple’s ARM is fully-custom. It wouldn’t surprise me if the fast-path parts are. Standard cell designs can be pretty fast due to constant advances in synthesis. They’ll always be behind full-custom on same process node just because the latter puts more optimization effort in. Most choose standard cell since it’s faster to develop (time to market) and cheaper. Those wanting max performance or lowest energy will be using full-custom if they can afford it. Also worth noting that the Apple A12 is 7nm vs Core i7’s 14nm per Intel’s site. Apple to apples would compare that design on 14nm or node with similar performance to it.
Btw, there’s detailed analysis below of the A12 with specs, parts breakdown, and die shot.
Great article! It certainly supports the idea that Apple switching to ARM processors for their laptops isn’t crazy talk. Perhaps retaining an intel-compatible CPU for a few generations to execute intel binaries until the shift is complete.
BTW, the article uses the acronym IPC without explaining it. It stands for Instructions Per Cycle. CPUs can and do execute multiple instructions per clock cycle so this is just a measure of how many.
Apple are more likely to choose "fat binaries" again. They're possibly the only company who could announce an architecture switch, OS revision bump, and corresponding changes to development tools all at once.
Also, they're known for having pulled this off successfully, twice (68000 to PPC, PPC to Intel), which certainly would lend them credibility if they decided to go for it.
I would buy such a laptop with zero qualms about them bungling the migration. If it had a decent keyboard and got rid of the touch bar. I'm much more concerned about having my experience ruined by those.
Aside from missing the escape key when I'm in VIM, the Touch Bar hasn't "ruined" my experience with a Macbook Pro.
I'll readily admit the thing is more cool than genuinely useful. It's just not such a gimmick as to actually "ruin" an experience. And the gain from TouchID offsets my pain from not having a physical escape key.
In this respect, the Macbook Air gets things right.
WRT the keyboard... your mileage will vary. I hated it at first. Pretty used to it now.
I don't hate the keyboard. I tried it in a store and after getting over the surprise of the extremely short key travel distance, it is quite usable - but the failure rate, the annoyances, the potential out-of-warranty cost turn it into a complete dealbreaker. It is simply not reliable enough, despite the redesign. Maybe now at the second redesign it's finally OK but no way I'm paying for one until we're certain, and that certainty will take a couple years of real world usage to materialize.
Regarding the touch bar: Indeed the Escape key is what breaks the deal. Honestly if it began after ESC I'd tolerate it - expensive, close to useless, but tolerable.
Ironically you chose one of the examples that's already been over a porting barrier from a RISC architecture: Photoshop up to (I think) CS4 ran on PowerPC.
The A12X is a 10 billion transistor 7 nm 12W chip.
The i6700K is a 3 billion transistor 14 nm 95W chip.
If you assume linear scaling on all three metrics (bad assumption, but rough rule of thumb) you get 10/3 * 14/7 * 12/95 -> 85%, roughly in line with benchmark results.
The i7 6700K is also a SoC and includes a GPU (weaker than the A12X's) and many other components that are also included in the A12X. It doesn't have quite the same level of integration as the A12X, but characterizing one as a 'SoC' and the other as a 'CPU' is inaccurate.
Power to performance is far from linear (and gets worse the higher you go). High TDP desktop chips are pushed to the point where a marginal increase in performance would require a massive increase in heat.
BTW ARM does not stand for “Advanced RISC Machines” as author says but “Acorn RISC Machines”. ARM was originally a joint venture between Acorn and Apple.
Yeah, I remembered some apple folks working on the 610 at Acorn but that was before the JV was formed. The JV I was talking about did, as you say, use "advanced".
Strictly speaking ARM hasn't stood for anything since 1998 or so when it rebranded itself as part of going public (before which it had a period of being Advanced RISC Machines, though it did start as Acorn RISC Machines). Indeed it's now "Arm" and not an initialism at all...
I meant the JV but in fact as noted upthread the JV itself always used "Advanced". I faked myself out by remembering Apple folks working with Acorn before the JV was formed.
I've recently found this interview from 2012 where Amazon's VP explains why he thinks mobile CPU architectures will take over server space. 7 years later - we are getting closer.
This is a really interesting video, not only for the content but because AWS ended up canceling their relationship with AMD to buy Annapurna labs to build their first ARM CPU.
It doesn't really matter whether they are comparable to desktop processors. What's important is are they good enough. Judging by the performance of the iPad Pro the answer is almost certainly yes for many or even most users.
The problem for me is they only support Windows currently - I'd love for a fast 8CX machine that ran Linux. The closest thing right now is the OP1/RK3399 in some of the mainline Chromebooks but I'm not sure if those have full Linux support yet.
The article already mentions this, but I want to re-state that Geekbench results do not correlate with real world prefroamance and systematically favor ios.
The community never received explanation from the authors on these discrepancies, so everybody should be cautious of these promises that these biases are eliminated in GB4.
> Geekbench results do not correlate with real world prefroamance and systematically favor ios.
So what benchmark would be fair to you? Even Cinebench doesn't closely approach a daily workload for the typical Cinema 4D user, it's just a different way to tax the system at full.
Also, throttling is less of (or even completely not?) an issue on iPads than on iPhones. You see that on benchmarks that typically tend to reach the throttling limits on iPhones like AnTuTu. The performance gap is much wider while the SoC isn't that much more powerful.
A desktop system with an ARM processor should closely match the synthetic results of GeekBench since GeekBench measures maximum performance not persistent performance.
A12X has reached parity or better with Intel in terms of integer performance, but what about floating point? Even if they are way behind Intel in that regard, I could see Apple making a desktop version with a beefy fp unit that a phone SOC wouldn't need.
Floating point is not a totem animal of the computing world anymore. If you have a lot of FP math to do, a specialised accelerator will do that 100 times faster than CPU.
Integer math is on the other side is what the most of programs you use every day are made of. And their complex, branching rich code is near impossible to feed to any specialised DSP.
Modern CPUs must be compared on integer math and logical operations performance, followed by their IO performance.
What an average user understands as performance today is really (integer perf + logic op perf) * effective I/O throughput.
Given that the A12X runs at < 10% of the TDP of an Intel i7 chip, would it be feasible to utilize many of these A12X's in a 10+ socket motherboard to significantly increase NUMA-compatible workloads?
Question: One of the points in the article is that in order to compare a desktop x86 chip to an ARM chip, you have to take clock speed into account. I assume both architectures are optimized for certain speeds at a low level (like literally the electricity flowing through the transistors), so you can't just overclock the hell out of an ARM (with cooling, etc.) to reach desktop speeds, nor vice versa to make a low-powered x86 chip.
Is this correct? Is the final output of the silicon from each architecture fundamentally different?
insight-free.
no one has cared about RISC vs CISC for decades; they care about TCO (which includes the network effects of their preferred software stack, etc).
ARMs big problem is that it's been overpromising (to the desktop and server world) for years. ultimately, joules-per-flop is going to be the same, no matter whether the FPU is wrapped inside x86 or ARM. so the question becomes: how cheap are decent/high-end ARM chips going to be? it's not nice to be "ahead" in a race to the bottom.
Now if Apple could be bothered to sell its fine CPUs for Linux machines. Not going to happen of course, so we have to wait for Qualcomm, AMD (K12), Huawei, ...
What about the impacts by archtecture-specific optimizations such as AVX vs NEON? I'm not sure but I suppose many desktop applications currently assumes the users are x86 so that they never think about ARM SIMD instructions.
Unlikely. Apple has tons of ARM expertise now, while RISC-V will still take lots of work to get in the same league, and they aren't so price-sensitive the ARM license costs are a big problem.
Apple is performance sensitive though, which matters a few years down the road. RISC-V has some pretty compelling technical advantages in that department: overal simplicity, compressed code density. Now add easy customization and the fact that the open ecosystem will gravitate towards, and it is likely a winner regardless.
So after 5-10 years Apple may have to abandon ARM.
Call me when there is something to show besides geekbench synthetic scores (which heavily favor Apple iOS devices, going as far as utilizing asic and specialized instructions for stuff like compression and javascript) and javascript benchmarks which run for milliseconds and are super prone to highly specific tuning (See the general uselessness of stuff like sunspider for actually measuring anything a user would consider to be performance)
from the article:
The next issue I want to address are fallacies
I've seen permeate discussions around ARM performance:
that a benchmark is not at all useful because it is
flawed or not objective in some way. I understand the
reaction to a degree (I once wrote an article
criticising camera benchmarks that reduce complex data
into single scalar numbers), but I also believe that
it’s possible to understand the shortcomings of a
benchmark, not to approach it objectively, and to
properly consider what it might indicate, rather than
dismissing it out of hand.
This is some pretty insipid hand waving, they then go on to address exactly nothing about any shortcomings, and keep pretending that these benchmarks generalize meaningfully.
They both borrowed so many ideas from each other that the architectures are nearly identical these days. Neither modern ARM or modern x86 architectures deserve to be called RISC or CISC.
Still, there are a few areas where the A64 architecture is theoretically "better" than x64, due to the lack of legacy.
The first is instruction decoding, x86 had to deal with a whole bunch of weird instruction length modifiers, prefix bytes, weird MOD RM encodings and years of extensions. To decode the 4 or 5 instructions per cycle that modern out of order microarchitectures demand, Intel CPUs have to attempt decoding instructions at every single byte offset in a 16 byte buffer, throwing away unwanted decodings. That's got to waste transistor and power budgets. In comparison, when an ARM CPU is in A64 mode, all instructions are the same length, making decoding multiple instructions per cycle trivial.
The second area is memory consistency guarantees. x64 has relatively strong guarantees, allowing simpler assembly code but at the cost of more complex memory subsystem between cores. A64 has much weaker ordering guarantees, which saves on hardware complexity, but required the programmer (and/or compiler) to insert memory fences whenever stricter ordering is required.
This is all theoretical, I have no idea what difference this makes at the practical level.