I don't think that x86 implementations are transistor limited. In fact Intel had to slap giant vector ALUs to find an use for them. And x86 L1 size is unfortunately limited by the page size, so you can't really trade one for the other.
Complex decoders do consume power of course, but I don't think they have a huge effect on the thermal budget. I also don't think they have a huge effect on latency and the uop L0 cache actually improves latency.
They make it harder to scale to higher width of course, but it seems that it hasn't been a huge obstacle so far.
One of the problems with the decoder is that it's "always on" so it always draws power (unlike the SIMD unit, for instance).
I also think that the uop L0 cache is closer to where the L1I cache is in a fixed width RISC implementation. The L1I cache of an x86 machine is quite far away from dispatch (in terms of pipeline stages). I think that z/Arch has something like 5-10 stages between L1I and dispatch, for instance.
And if you start comparing the uop cache with an L1I$ of a fixed width RISC machine, things dont't look good for CISC (the uop cache is extremely inefficient in terms of capacity/silicon, only holding a handful kuops). It's probably not an entirely fair comparison, but neither is comparing the L1I$ of a CISC machine with that of a RISC machine.
I don't think it has so much to do with being transistor limited as it has to do with keeping the latency sensitive parts tight and avoiding unnecessary pipeline stages etc. It's "easy" to throw transistors on L2 & L3 cache, but minimising branch misprediction penalties and keeping a wide pipeline 100% fed with instructions all the time is trickier.
Why would x86 L1 be limited by the page size? The cache works at 128 byte granularity. If it's too big it might increase tlb pressure I suppose, but it makes sense that increases in L1 would be best accompanied by increases in tlb size.
Complex decoders do consume power of course, but I don't think they have a huge effect on the thermal budget. I also don't think they have a huge effect on latency and the uop L0 cache actually improves latency.
They make it harder to scale to higher width of course, but it seems that it hasn't been a huge obstacle so far.