All of that stuff can be good, but has tradeoffs. Longer pipelines result in worse branching performance, caching interferes with write-heavy code that's mainly about moving data (like for games), and so on. I feel that putting extra transistors towards large numbers of cores with short 4 stage pipelines (like in early PowerPC) would have been better.
This is one of the more concise benchmark comparisons, in this case having a 3.6 GHz i9 and 1.4 GHz Pentium 3 (released starting in 1999):
So this is 8 cores vs 1, at 2.57 times the clock speed. So per-core performance has increased:
(18892/299) * (1/8) * (1.4/3.6) = 3.07
A 3x fold increase in 20 years is admirable but 1/3000 what would have been predicted if performance had followed Moore's Law. To me, this indicates that per-core performance stopped really increasing sometime around 2005 at the latest. That's why fabs moved towards lower-cost mobile and embedded chips.
This is one of the more concise benchmark comparisons, in this case having a 3.6 GHz i9 and 1.4 GHz Pentium 3 (released starting in 1999):
https://www.cpubenchmark.net/cpu.php?cpu=Intel+Pentium+III+1...
So this is 8 cores vs 1, at 2.57 times the clock speed. So per-core performance has increased:
(18892/299) * (1/8) * (1.4/3.6) = 3.07
A 3x fold increase in 20 years is admirable but 1/3000 what would have been predicted if performance had followed Moore's Law. To me, this indicates that per-core performance stopped really increasing sometime around 2005 at the latest. That's why fabs moved towards lower-cost mobile and embedded chips.