The whole point of this thread is realization of opposite. Slower algo executes only 2 instructions in a loop, but second one directly depends on the result of the first (induces pipeline stall) while the fast version brute forces a ton of instructions at full CPU IPC.
If you look carefully at the generated assembly shown in the article, the vectorized loop actually executes less instructions in total as while an iteration is longer it more than make it up by iterating less times.
Edit: btw, while the fast version has bo loop carried dependencies and it uses 4x SIMD, it us only twice as fast
I wonder if a manually unrolled (with 4 accumulators) and vectorized version of the strength reduced one could be faster still.
If you look carefully you will find that the slower algorithm uses less instructions and the faster algorithm uses 2xSIMD and more than twice as many instructions.
And yes unrolling the loop carried dependency on the strength reduced version will certainly make it faster as it's the only reason it's slower to begin with.
I might be somehow miscounting, but it seems to me that for the slow implementation a loop iteration issues 6 instructions, for the fast one it issues 21, but it is unrolled 4 times (compare the loop counter increment), so it iterates one fourth of the times and for the whole loop it ends up actually issuing slightly less instructions.
edit: to be clear, I'm only arguing about two things that the original parent quitestioned: whether vectorization is not free (it is because wider ALUs require less instructions) and whether the second loop used more instructions (it does not as it is unrolled by 4).
I somehow got tripped up by the 4xSIMD. I was assuming you meant it's using 4x 64bit SIMD there which it doesn't. mulpd and addpd are 2x 64bit, also visible by the xmm instead of ymm registers.
I got sloppy on the difference between all instructions including the loop logic vs just the instructions necessary to do the main computation. Obviously the first is the correct measure and I was sloppy.
I think the confusion might be coming from mixing "instructions" with "operations", which are not equivalent especially since we are discussing SIMD which packs in multiple operations (single instruction operating on multiple data). If you re-read this thread and replace instructions with operations it makes more sense.
The whole point of this thread is realization of opposite. Slower algo executes only 2 instructions in a loop, but second one directly depends on the result of the first (induces pipeline stall) while the fast version brute forces a ton of instructions at full CPU IPC.