I really don't see what you're getting at here. Even limited to only three named registers I don't think the example you provided would pose an issue on x86. (I'm not very familiar with ARM but I don't think it would pose any issue there either.)
In practice, x86_64 works just fine for HPC number crunching code. Outside of some serious number crunching, when are you going to have more live values than named registers, have instruction streams whose output depends on _all_ of those values (which is why they would be live), and also those streams complete so quickly that you stall on the next set of loads? And you have absolutely no other useful work to do? Honestly I think you're being silly.
Historically, I understand that the 32 bit version of x86 did have scheduling challenges surrounding function calls. The 64 bit version of the ISA expanded the number of named registers and (as far as I understand things) it largely resolved the issue.
Also note that typical hardware can sustain a surprisingly large number of loads per clock. You just need to find something useful to do while you wait for the load to complete. In case you really can't there's also SMT. Really though, the PRF and ROB are only so large.
> If you want to run 4 instructions at the same time, you need no reorder buffer.
You always need a reorder buffer if you want to achieve good performance. Among other issues, the compiler can't predict the latency for each load in advance due to caching behavior depending on the runtime state of the full computer system. I previously mentioned Itanium. It's directly relevant here.
> Imagine if b,c,d get modified inside the inner loop and spilled into memory, you have to track memory locations in order to do register renaming.
No. You can't just rename registers any longer. A store to memory means the memory model for the ISA gets involved. Things become significantly more complicated. The store buffer exists specifically to deal with such issues efficiently on an OoO core. Seriously, go read about it. It's astoundingly complicated for any OoO core regardless of the ISA.
> the compiler can more easily interleave instructions, which can help reduce the number of instructions that need to be in the reorder buffer
Unless I have a serious misunderstanding (I don't design hardware, so I might) everything passes through the reorder buffer. Every instruction is speculative until all previous instructions have retired. (https://news.ycombinator.com/item?id=20165289)
In practice, x86_64 works just fine for HPC number crunching code. Outside of some serious number crunching, when are you going to have more live values than named registers, have instruction streams whose output depends on _all_ of those values (which is why they would be live), and also those streams complete so quickly that you stall on the next set of loads? And you have absolutely no other useful work to do? Honestly I think you're being silly.
Historically, I understand that the 32 bit version of x86 did have scheduling challenges surrounding function calls. The 64 bit version of the ISA expanded the number of named registers and (as far as I understand things) it largely resolved the issue.
Also note that typical hardware can sustain a surprisingly large number of loads per clock. You just need to find something useful to do while you wait for the load to complete. In case you really can't there's also SMT. Really though, the PRF and ROB are only so large.
> If you want to run 4 instructions at the same time, you need no reorder buffer.
You always need a reorder buffer if you want to achieve good performance. Among other issues, the compiler can't predict the latency for each load in advance due to caching behavior depending on the runtime state of the full computer system. I previously mentioned Itanium. It's directly relevant here.
> Imagine if b,c,d get modified inside the inner loop and spilled into memory, you have to track memory locations in order to do register renaming.
No. You can't just rename registers any longer. A store to memory means the memory model for the ISA gets involved. Things become significantly more complicated. The store buffer exists specifically to deal with such issues efficiently on an OoO core. Seriously, go read about it. It's astoundingly complicated for any OoO core regardless of the ISA.
> the compiler can more easily interleave instructions, which can help reduce the number of instructions that need to be in the reorder buffer
Unless I have a serious misunderstanding (I don't design hardware, so I might) everything passes through the reorder buffer. Every instruction is speculative until all previous instructions have retired. (https://news.ycombinator.com/item?id=20165289)