I agree, but in absence of an accurate method this felt like the second best method.
Then there's also the elephant in the room that I'd like to write an article about some time: The decoder/translator + uop-cache in the front end has a devastating effect on instruction fetch & decode performance. That silicon eats power, could be used for better things (larger L1I cache etc), adds latency, limits how wide you can decode, and so on.
Rationale: CISC is not just about density. With good RISC you can get much better fetch & decode bandwidth (all other things being equal). E.g. see Apple silicon.
I don't think that x86 implementations are transistor limited. In fact Intel had to slap giant vector ALUs to find an use for them. And x86 L1 size is unfortunately limited by the page size, so you can't really trade one for the other.
Complex decoders do consume power of course, but I don't think they have a huge effect on the thermal budget. I also don't think they have a huge effect on latency and the uop L0 cache actually improves latency.
They make it harder to scale to higher width of course, but it seems that it hasn't been a huge obstacle so far.
One of the problems with the decoder is that it's "always on" so it always draws power (unlike the SIMD unit, for instance).
I also think that the uop L0 cache is closer to where the L1I cache is in a fixed width RISC implementation. The L1I cache of an x86 machine is quite far away from dispatch (in terms of pipeline stages). I think that z/Arch has something like 5-10 stages between L1I and dispatch, for instance.
And if you start comparing the uop cache with an L1I$ of a fixed width RISC machine, things dont't look good for CISC (the uop cache is extremely inefficient in terms of capacity/silicon, only holding a handful kuops). It's probably not an entirely fair comparison, but neither is comparing the L1I$ of a CISC machine with that of a RISC machine.
I don't think it has so much to do with being transistor limited as it has to do with keeping the latency sensitive parts tight and avoiding unnecessary pipeline stages etc. It's "easy" to throw transistors on L2 & L3 cache, but minimising branch misprediction penalties and keeping a wide pipeline 100% fed with instructions all the time is trickier.
Why would x86 L1 be limited by the page size? The cache works at 128 byte granularity. If it's too big it might increase tlb pressure I suppose, but it makes sense that increases in L1 would be best accompanied by increases in tlb size.
Then there's also the elephant in the room that I'd like to write an article about some time: The decoder/translator + uop-cache in the front end has a devastating effect on instruction fetch & decode performance. That silicon eats power, could be used for better things (larger L1I cache etc), adds latency, limits how wide you can decode, and so on.
Rationale: CISC is not just about density. With good RISC you can get much better fetch & decode bandwidth (all other things being equal). E.g. see Apple silicon.