> larger transistor budget ... because we (the chip designer) are okay with larg...

jcranmer · on Aug 25, 2016

In terms of die area, even for processors that implement the x86 instruction set, the instruction decode engine is smaller than the out-of-order execution logic (register renaming and the retire queue are quite expensive in space). The branch predictor is larger than both if you have dynamic branch prediction (i.e., if you want a branch predictor that works). Load/store units pretty much dwarf any other execution unit (turns out paging isn't cheap in die area), particularly when you include the L1 cache.

Here's an example of die area: http://farm6.staticflickr.com/5321/9104546631_4c7a4a023b_o.j... I don't think it's the best picture, but it's surprisingly hard to find these for recent x86 processors.

If you want to stick more cores on a single die, you have to shrink the size of a core. And looking at die space, the real big wins are the caches, out-of-order execution, and branch predictors--losing all of which will kill your IPC. The other problem with high core counts is memory bandwidth. The instruction fetch bandwidth alone on 1024 cores runs to GB/s. Cache coherency traffic and core communication traffic could likely suck up the rest of the bandwidth.

Proposing to rip out most of the instructions leaves you with the problem that you're ripping out hardware-accelerated instructions, and you have nothing to compensate for the lost speed. You can't increase clock frequencies (power will kill you); you can't increase core counts (the memory bandwidth is insufficient, not to mention Amdahl's law).

vonmoltke · on Aug 25, 2016

Just for reference, IBM tried to do that with the POWER6. Dropped out-of-order execution and (I think) neutered branch prediction, while jacking up the clock frequency. The result was a steaming pile of shit. Performance on our code cratered with respect to the POWER5, Core2, and Nehalem. I ended up having to add ifdef blocks to rewrite some significant algorithms in order to get the POWER6 to be comparable. IBM fixed it somewhat with the POWER6+, but wisely reversed course with the POWER7.

gpderetta · on Aug 24, 2016

Those 4 cores in your 1024 core processor would be embarrassingly underpowered.

In a recent, high frequency, aggressively OoO CPU, a large part of the die is used by caches, the register file, vector ALUs and the OoO machinery; by comparison scalar ALUs and especially decoders do not take a lot of machinery. In particular microcode for all legacy instructions takes only a tiny amount of space. Legacy support might have a cost on tiny embedded processors, but it just doesn't matter in a large desktop/server class CPU.

And yes, some of those specialized instructions are rarely used (there is a reason they are called dark silicon), but it means that Intel (or ARM, IBM) doesn't need to spin a new CPU for a specialized workload. Intel is even rumored to add custom instructions required by large customers (FB, Google) on all its CPUs, which are left disabled for other customers.

Taren · on Aug 24, 2016

The cases where such a cpu would be efficient are already the places where a normal cpu is extremely fast. So it would also fail in the same ways, pipeline stalls and all. If you need to wait for the answer to one computation to know what to do next any additional cores won't add speed.

A good part of the complexity of current cpus comes from features like branch prediction, speculative execution and so on so removing features wouldn't make them drastically simpler. Many of the truly rarely used instrucctions aren't build in hardware and therefore don't contribute to the complexity of the cpu anyway. Others are rarely used but add huge speed boosts for important special purpose tasks, think os kernel.

mike_hearn · on Aug 25, 2016

But as we can see in the original article of this thread, software is not making use of those instructions

The article doesn't show that or even come close to it. To measure what instructions software uses, you'd need to measure it at runtime.

The metric of simply counting frequency on disk ignores loops: you might have a single server binary that contains a single AES-NI instruction and conclude this instruction was useless and never used, but if your server spends 50% of its time decrypting SSL connections at high speed, then things look rather different.