Interesting read, one thing I don’t understand is how much space does loop buffer take on the die? I’m curious with it removed, on future chips could you use the space for something more useful like a bigger L2 cache?
I think most modern chips are routing constrained and not floorspace constrained. You can build tons of features but getting them all power and normalized signals is an absolute chore.
My understanding is that it's a pretty small optimization on the front end. It doesn't have a lot of entries to begin with (144) so the amount of space saved is probably negligible. Theoretically, the loop buffer would let you save power or improve performance in a tight loop. In practice, it doesn't seem to do either, and AMD removed it completely for Zen 5.
Judging from the diagrams, the loop buffer is using the same storage as the micro-op queue that's there anyway. If that is accurate (and it does seem plausible), then the area cost is just some additional control logic. I suspect the most expensive part is detecting a loop in the first place, but that's probably quite small compared to the size of the queue.
It says 144 micro-op entries per core. Not sure how many bytes that is, but L2 caches these days are around 1MB per core, so assuming the loop buffer die space is mostly storage (sounds like it) then it wouldn't make a notable difference.