I don't think that x86 implementations are transistor limited. In fact Intel had...

mbitsnbites · on Dec 2, 2022

One of the problems with the decoder is that it's "always on" so it always draws power (unlike the SIMD unit, for instance).

I also think that the uop L0 cache is closer to where the L1I cache is in a fixed width RISC implementation. The L1I cache of an x86 machine is quite far away from dispatch (in terms of pipeline stages). I think that z/Arch has something like 5-10 stages between L1I and dispatch, for instance.

And if you start comparing the uop cache with an L1I$ of a fixed width RISC machine, things dont't look good for CISC (the uop cache is extremely inefficient in terms of capacity/silicon, only holding a handful kuops). It's probably not an entirely fair comparison, but neither is comparing the L1I$ of a CISC machine with that of a RISC machine.

I don't think it has so much to do with being transistor limited as it has to do with keeping the latency sensitive parts tight and avoiding unnecessary pipeline stages etc. It's "easy" to throw transistors on L2 & L3 cache, but minimising branch misprediction penalties and keeping a wide pipeline 100% fed with instructions all the time is trickier.

muricula · on Dec 2, 2022

Why would x86 L1 be limited by the page size? The cache works at 128 byte granularity. If it's too big it might increase tlb pressure I suppose, but it makes sense that increases in L1 would be best accompanied by increases in tlb size.

gpderetta · on Dec 2, 2022

It is a quirk of VIPT caches. After a certain point you can't increase index size, only ways and quickly reach diminishing returns.

There might be of course workarounds, but intel L1 cache sizes have been more or less stable for about a decade

mbitsnbites · on Dec 3, 2022

I believe that ARM has a patent on a workaround. Apple M1 has 192KB L1I.

gpderetta · on Dec 3, 2022

Apple has a larger page size though.