Just to nitpick, backwards compatibility isn't really a huge issue for Intel. Most of the really old stuff that's a pain to maintain can be shoved in microcode; compilers won't emit those instructions.
There are obvious downsides to the architecture, but the need to be backwards compatibility shouldn't hurt it too much.
GPU workloads are very different in that generally you don't have to look particularly hard to find a bunch of parallelism that you can exploit (if you did, your code would run terribly); so you can generally gain a load of performance by just scaling up your design.
CPUs are super restricted by the single threaded, branching nature of the code you run on them, and this is what makes CPU performance a little more nuanced, and not directly comparable.
That's not really true; backwards compatibility on x86 architectures takes a tremendous amount of power and die space, and the 'throw it in microcode' solution only partially mitigates this issue.
I can't remember where I read it but something like 30+% of an Intel CPU die area/power consumption is due to the x86 ISA. Apparently the original Pentium CPU was 40% instruction decoding by die area. And the ISA has grown enormously since then.
"CPUs are super restricted by the single threaded, branching nature of the code you run on them, and this is what makes CPU performance a little more nuanced, and not directly comparable."
Ironically, to really hit peak performance of a modern AVX2 or later CPU, you have to embrace many of the design principles that lead to efficient GPU code:
1. Multiple threads per core to make use of the dual vector units introduced in Haswell
2. SIMD-like thinking to remap tasks into the 8-way and soon to be 16-way vector units
3. Running multiple threads across multiple cores
4. Micromanaging the L1 cache and treating the AVX/SSE registers as L0 cache
Where the CPU prevails is for fundamentally serial algorithms that cannot be mapped into a SIMD implementation. Mike Acton's Data-Oriented Design covers this case nicely IMO.
There are obvious downsides to the architecture, but the need to be backwards compatibility shouldn't hurt it too much.
GPU workloads are very different in that generally you don't have to look particularly hard to find a bunch of parallelism that you can exploit (if you did, your code would run terribly); so you can generally gain a load of performance by just scaling up your design.
CPUs are super restricted by the single threaded, branching nature of the code you run on them, and this is what makes CPU performance a little more nuanced, and not directly comparable.