In terms of die area, even for processors that implement the x86 instruction set...

In terms of die area, even for processors that implement the x86 instruction set, the instruction decode engine is smaller than the out-of-order execution logic (register renaming and the retire queue are quite expensive in space). The branch predictor is larger than both if you have dynamic branch prediction (i.e., if you want a branch predictor that works). Load/store units pretty much dwarf any other execution unit (turns out paging isn't cheap in die area), particularly when you include the L1 cache.

Here's an example of die area: http://farm6.staticflickr.com/5321/9104546631_4c7a4a023b_o.j... I don't think it's the best picture, but it's surprisingly hard to find these for recent x86 processors.

If you want to stick more cores on a single die, you have to shrink the size of a core. And looking at die space, the real big wins are the caches, out-of-order execution, and branch predictors--losing all of which will kill your IPC. The other problem with high core counts is memory bandwidth. The instruction fetch bandwidth alone on 1024 cores runs to GB/s. Cache coherency traffic and core communication traffic could likely suck up the rest of the bandwidth.

Proposing to rip out most of the instructions leaves you with the problem that you're ripping out hardware-accelerated instructions, and you have nothing to compensate for the lost speed. You can't increase clock frequencies (power will kill you); you can't increase core counts (the memory bandwidth is insufficient, not to mention Amdahl's law).