Hacker News new | past | comments | ask | show | jobs | submit login

That doesn't make any sense. The ROB is after instructions have been cracked into uops; the internal format and length of uops is "whatever is easiest for the design", since it's not visible to the outside world.

This argument does apply to the L1 cache, which sits before decode. (It does not apply to uop caches/L0 caches, but is related to them anyway, as they are most useful for CISCy designs, with instructions that decode in complicated ways into many uops.)




Maybe it wasn't clear, but the article I linked is saying that compared to M1, x86 architectures are decode-limited, because parallel decoding with variable-length instructions is tricky. Intel and AMD (again according to the linked article) have at most 4 decoders, while M1 has 8.

So yes the ROB is after decoding, but surely there's little point in having the ROB be larger than can be kept relatively full by the decoders.


Well intentioned as that article may be, it makes plenty of mistakes. For a rather glaring one, no, uops are not linked to OoO.

Secondly, it ignores the existence of uop caches that x86 designs use in order to not need such wide decoders. Some ARM designs also use uop caches, FWIW, since it can be more power efficient.

That doesn't mean that fixed width decoding like on aarch64 isn't an advantage; it certainly is. Also, M1 is certainly a very impressive design, though of course it also helps that it's fabbed on the latest and greatest process.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: