Your example is incorrect. But even if it were correct, a much better way of visualizing this is by unrolling the loop (as is done in my pdqsort implementation): https://godbolt.org/g/qWkmG4
The odd writing style of first doing all comparisons is intentional - it increases the parallelism and reduces the data dependency in the generated code. Note how all the colors are mixed through eachother: this is a good sign.
Also note that there are no branches at all in the generated code. It's 'straight' code that the CPU can just march through at full speed. This is what makes it so stupidly fast.
I think you misunderstood the OP (or maybe I have?) - it wasn't about parallelism, just about that one use of a comparison inside the addition. I showed that CPUs don't need to generate conditional jumps for that, that's all.
EDIT: To clarify, my example was completely contrived, I just reused your variable names for some sense of familiarity.
Exactly, that's what I asked about. I didn't know about setl and thus couldn't figure out how to get around the conditional branch. Thanks for the explanation!
The odd writing style of first doing all comparisons is intentional - it increases the parallelism and reduces the data dependency in the generated code. Note how all the colors are mixed through eachother: this is a good sign.
Also note that there are no branches at all in the generated code. It's 'straight' code that the CPU can just march through at full speed. This is what makes it so stupidly fast.