... because we (the chip designer) are okay with larger footprint per core.
> specialized instructions are pretty much free
... only after we have fixed the footprint per core. But if we're willing to vary that parameter, then the specialized instructions are not free.
Not to mention, the main article of this thread is a strong evidence that those specialized instructions are almost never used!
As for your point about 1024 cores, the whole point I'm trying to make is that whatever software does today with 4 cores in a 4-core processor, could be done by 4-cores in a 1024-core processor, because those 4-cores don't implement the instructions that are not needed. And that means you have 1020 cores free sitting in your microprocessor. You could only make your computations faster or at the same speed (in the worst case) in their presence, not slower.
> simple CPUs are just slow
I would like to see any source of this claim. The only reason I can think of is that complex CPUs implement some instructions that help speed up. But as we can see in the original article of this thread, software is not making use of those instructions. So I don't see how a simple CPU (that picks the best 5-7 instructions that give turing completeness, as well as best performance) is any slower.
Note, by a simple CPU, I'm not advocating eliminating pipelines and caches, etc. All I'm saying is that once you optimize a CPU design and eliminate redundancy as well as the requirement of backward compatibility, you can get a much better performing CPU that what we have currently.
In terms of die area, even for processors that implement the x86 instruction set, the instruction decode engine is smaller than the out-of-order execution logic (register renaming and the retire queue are quite expensive in space). The branch predictor is larger than both if you have dynamic branch prediction (i.e., if you want a branch predictor that works). Load/store units pretty much dwarf any other execution unit (turns out paging isn't cheap in die area), particularly when you include the L1 cache.
If you want to stick more cores on a single die, you have to shrink the size of a core. And looking at die space, the real big wins are the caches, out-of-order execution, and branch predictors--losing all of which will kill your IPC. The other problem with high core counts is memory bandwidth. The instruction fetch bandwidth alone on 1024 cores runs to GB/s. Cache coherency traffic and core communication traffic could likely suck up the rest of the bandwidth.
Proposing to rip out most of the instructions leaves you with the problem that you're ripping out hardware-accelerated instructions, and you have nothing to compensate for the lost speed. You can't increase clock frequencies (power will kill you); you can't increase core counts (the memory bandwidth is insufficient, not to mention Amdahl's law).
Just for reference, IBM tried to do that with the POWER6. Dropped out-of-order execution and (I think) neutered branch prediction, while jacking up the clock frequency. The result was a steaming pile of shit. Performance on our code cratered with respect to the POWER5, Core2, and Nehalem. I ended up having to add ifdef blocks to rewrite some significant algorithms in order to get the POWER6 to be comparable. IBM fixed it somewhat with the POWER6+, but wisely reversed course with the POWER7.
Those 4 cores in your 1024 core processor would be embarrassingly underpowered.
In a recent, high frequency, aggressively OoO CPU, a large part of the die is used by caches, the register file, vector ALUs and the OoO machinery; by comparison scalar ALUs and especially decoders do not take a lot of machinery. In particular microcode for all legacy instructions takes only a tiny amount of space. Legacy support might have a cost on tiny embedded processors, but it just doesn't matter in a large desktop/server class CPU.
And yes, some of those specialized instructions are rarely used (there is a reason they are called dark silicon), but it means that Intel (or ARM, IBM) doesn't need to spin a new CPU for a specialized workload. Intel is even rumored to add custom instructions required by large customers (FB, Google) on all its CPUs, which are left disabled for other customers.
The cases where such a cpu would be efficient are already the places where a normal cpu is extremely fast. So it would also fail in the same ways, pipeline stalls and all. If you need to wait for the answer to one computation to know what to do next any additional cores won't add speed.
A good part of the complexity of current cpus comes from features like branch prediction, speculative execution and so on so removing features wouldn't make them drastically simpler.
Many of the truly rarely used instrucctions aren't build in hardware and therefore don't contribute to the complexity of the cpu anyway. Others are rarely used but add huge speed boosts for important special purpose tasks, think os kernel.
But as we can see in the original article of this thread, software is not making use of those instructions
The article doesn't show that or even come close to it. To measure what instructions software uses, you'd need to measure it at runtime.
The metric of simply counting frequency on disk ignores loops: you might have a single server binary that contains a single AES-NI instruction and conclude this instruction was useless and never used, but if your server spends 50% of its time decrypting SSL connections at high speed, then things look rather different.
... because we (the chip designer) are okay with larger footprint per core.
> specialized instructions are pretty much free
... only after we have fixed the footprint per core. But if we're willing to vary that parameter, then the specialized instructions are not free.
Not to mention, the main article of this thread is a strong evidence that those specialized instructions are almost never used!
As for your point about 1024 cores, the whole point I'm trying to make is that whatever software does today with 4 cores in a 4-core processor, could be done by 4-cores in a 1024-core processor, because those 4-cores don't implement the instructions that are not needed. And that means you have 1020 cores free sitting in your microprocessor. You could only make your computations faster or at the same speed (in the worst case) in their presence, not slower.
> simple CPUs are just slow
I would like to see any source of this claim. The only reason I can think of is that complex CPUs implement some instructions that help speed up. But as we can see in the original article of this thread, software is not making use of those instructions. So I don't see how a simple CPU (that picks the best 5-7 instructions that give turing completeness, as well as best performance) is any slower.
Note, by a simple CPU, I'm not advocating eliminating pipelines and caches, etc. All I'm saying is that once you optimize a CPU design and eliminate redundancy as well as the requirement of backward compatibility, you can get a much better performing CPU that what we have currently.