The operations are somewhat different though. Store-to-load forwarding is more complicated and doesn't completely eliminate the operation, it just significantly reduces the cycle count when successful.
Since all registers are used, and all but two instructions are dependent, in the assembly the blocks have to follow one another. There`s also spilling of the b,c,d variables, they have to be read from registers (which could be elided). Assuming no re-order buffer, these instructions runs in three cycles (the first two are independent) - even though the top level instructions are independent.
If you want them to run all statements with 4 instructions at a time, you need to have a reorder buffer that covers the whole sequence (12 instructions). (Imagine if b,c,d get modified inside the inner loop and spilled into memory, you have to track memory locations in order to do register renaming.)
Now lets assume you have 6 registers. Now all variables fit in registers and the compiler can easily interleave the code giving a sequence of 3 or 4 independent instructions at a time. If you want to run 4 instructions at the same time, you need no reorder buffer.
This is a kind of specific example, but it shows that if you have more registers (i.e. ARM vs x86), the compiler can more easily interleave instructions, which can help reduce the number of instructions that need to be in the reorder buffer. Or with the same size re-order buffer, its easier to find more independent instructions and keep all the execution units fed. Or, when jumping to some code thats not in pipeline or icache, it allows to sooner run more instructions in parallel, when only a small number of instructions are decoded and in the re-order buffer.
I really don't see what you're getting at here. Even limited to only three named registers I don't think the example you provided would pose an issue on x86. (I'm not very familiar with ARM but I don't think it would pose any issue there either.)
In practice, x86_64 works just fine for HPC number crunching code. Outside of some serious number crunching, when are you going to have more live values than named registers, have instruction streams whose output depends on _all_ of those values (which is why they would be live), and also those streams complete so quickly that you stall on the next set of loads? And you have absolutely no other useful work to do? Honestly I think you're being silly.
Historically, I understand that the 32 bit version of x86 did have scheduling challenges surrounding function calls. The 64 bit version of the ISA expanded the number of named registers and (as far as I understand things) it largely resolved the issue.
Also note that typical hardware can sustain a surprisingly large number of loads per clock. You just need to find something useful to do while you wait for the load to complete. In case you really can't there's also SMT. Really though, the PRF and ROB are only so large.
> If you want to run 4 instructions at the same time, you need no reorder buffer.
You always need a reorder buffer if you want to achieve good performance. Among other issues, the compiler can't predict the latency for each load in advance due to caching behavior depending on the runtime state of the full computer system. I previously mentioned Itanium. It's directly relevant here.
> Imagine if b,c,d get modified inside the inner loop and spilled into memory, you have to track memory locations in order to do register renaming.
No. You can't just rename registers any longer. A store to memory means the memory model for the ISA gets involved. Things become significantly more complicated. The store buffer exists specifically to deal with such issues efficiently on an OoO core. Seriously, go read about it. It's astoundingly complicated for any OoO core regardless of the ISA.
> the compiler can more easily interleave instructions, which can help reduce the number of instructions that need to be in the reorder buffer
Unless I have a serious misunderstanding (I don't design hardware, so I might) everything passes through the reorder buffer. Every instruction is speculative until all previous instructions have retired. (https://news.ycombinator.com/item?id=20165289)
Although apparently Zen 2 changed this and can pull off zero latency. (https://www.agner.org/forum/viewtopic.php?t=41)
Some general background: (https://travisdowns.github.io/blog/2019/06/11/speed-limits.h...)