In terms of die area, even for processors that implement the x86 instruction set, the instruction decode engine is smaller than the out-of-order execution logic (register renaming and the retire queue are quite expensive in space). The branch predictor is larger than both if you have dynamic branch prediction (i.e., if you want a branch predictor that works). Load/store units pretty much dwarf any other execution unit (turns out paging isn't cheap in die area), particularly when you include the L1 cache.
If you want to stick more cores on a single die, you have to shrink the size of a core. And looking at die space, the real big wins are the caches, out-of-order execution, and branch predictors--losing all of which will kill your IPC. The other problem with high core counts is memory bandwidth. The instruction fetch bandwidth alone on 1024 cores runs to GB/s. Cache coherency traffic and core communication traffic could likely suck up the rest of the bandwidth.
Proposing to rip out most of the instructions leaves you with the problem that you're ripping out hardware-accelerated instructions, and you have nothing to compensate for the lost speed. You can't increase clock frequencies (power will kill you); you can't increase core counts (the memory bandwidth is insufficient, not to mention Amdahl's law).
Just for reference, IBM tried to do that with the POWER6. Dropped out-of-order execution and (I think) neutered branch prediction, while jacking up the clock frequency. The result was a steaming pile of shit. Performance on our code cratered with respect to the POWER5, Core2, and Nehalem. I ended up having to add ifdef blocks to rewrite some significant algorithms in order to get the POWER6 to be comparable. IBM fixed it somewhat with the POWER6+, but wisely reversed course with the POWER7.
Here's an example of die area: http://farm6.staticflickr.com/5321/9104546631_4c7a4a023b_o.j... I don't think it's the best picture, but it's surprisingly hard to find these for recent x86 processors.
If you want to stick more cores on a single die, you have to shrink the size of a core. And looking at die space, the real big wins are the caches, out-of-order execution, and branch predictors--losing all of which will kill your IPC. The other problem with high core counts is memory bandwidth. The instruction fetch bandwidth alone on 1024 cores runs to GB/s. Cache coherency traffic and core communication traffic could likely suck up the rest of the bandwidth.
Proposing to rip out most of the instructions leaves you with the problem that you're ripping out hardware-accelerated instructions, and you have nothing to compensate for the lost speed. You can't increase clock frequencies (power will kill you); you can't increase core counts (the memory bandwidth is insufficient, not to mention Amdahl's law).