Vectorisation is not free, There is one other dimension to optimise for: power. ...

ascar · on May 29, 2022

Do you have a reference for that? I googled and I couldn't find anything good.

While it might sound intuitive that SIMD instructions consume more power, I don't think that's necessarily true to a relevant degree in practice. My understanding is CPU power consumption is mostly tied to inefficiences that cause energy loss via heat, while the actual computation doesn't consume any energy per se. So electrons traveling a more complex path probably cause somehwat more energy loss as there is more wire/transistors to pass. But most of the total loss doesn't actually occure in the ALU. Empirically from what you can see operating systems do, the most effective way of consuming less power is actually running on a slower clock cycle and the most effective way to achieve that is getting work done faster and that's not tied to the number of instructions.

The Stackoverflow question here [1] seems to suggest that SIMD vs no SIMD has a neglectable overhead compared to entering a lower power state sooner.

[1] https://stackoverflow.com/questions/19722950/do-sse-instruct...

tomxor · on May 30, 2022

I think in summary what you are alluding to is instruction decoding and scheduling as the sibling comment points out, which is indeed a large cost in both speed and power.

> SIMD vs no SIMD has a neglectable overhead compared to entering a lower power state sooner.

Yes on the same CPU as I suggested, real world difference may be unmeasurable. However note that this particular case is interesting because it's not comparing fewer serial multiplies to more SIMD multiplies, it's comparing SIMD multiplies to no multiplies but with a serial constraint due to variable dependence... i.e it's SIMD vs no multiply without any other difference in number or type of ALU ops... which again could make no difference on a big x86 in practice, but it would be interesting to know.

All of this changes if you are coding for a lower power device and have choice of hardware.

tylerhou · on May 30, 2022

But you’re not counting static power of the CPU and the whole system. It is possible for the vectorized code to consume less power overall because it can complete faster, using less static power.

janwas · on May 30, 2022

We mean energy, right? The surprising fact is that the energy cost of executing an instruction (scheduling etc.) is much higher than the actual operation. Thus SIMD amortizes that per-instruction overhead over multiple elements, leading to 5-10x gains ( https://pdfs.semanticscholar.org/f464/74f6ae2dde68d6ccaf9f53...).

Even in this example, which apparently has 4x vector instructions vs 1x scalar, AVX-512 (and probably even AVX2) would reduce energy usage because they can do 8 (or 4 for AVX2) 64-bit elements per cycle.

tomxor · on May 30, 2022

> The surprising fact is that the energy cost of executing an instruction (scheduling etc.) is much higher than the actual operation.

Good point, decoding and scheduling are expensive and SIMD certainly eliminates a lot of that. However the alternative algorithm has even less decoding and scheduling to do, since it completely eliminates the multiplication operations without increasing the number of additions. But even then I wouldn't be surprised that it makes no difference to energy on any x86, as I said it was more of a fundamental observation - for a different application this is useful when selecting the actual hardware if you are not already constrained to a particular chip.

gpderetta · on May 30, 2022

The slower algo will use more instructions as will have to run for more iterations. The faster algo will use wider ALUs that are more power hungry, so it appears a wash. But because less instructions are in flight, less power need to be spent to keep track of them or renaming registers, caches need to powered for a smaller amount of time and so on.

rasz · on May 30, 2022

>The slower algo will use more instructions

The whole point of this thread is realization of opposite. Slower algo executes only 2 instructions in a loop, but second one directly depends on the result of the first (induces pipeline stall) while the fast version brute forces a ton of instructions at full CPU IPC.

gpderetta · on May 30, 2022

If you look carefully at the generated assembly shown in the article, the vectorized loop actually executes less instructions in total as while an iteration is longer it more than make it up by iterating less times.

Edit: btw, while the fast version has bo loop carried dependencies and it uses 4x SIMD, it us only twice as fast

I wonder if a manually unrolled (with 4 accumulators) and vectorized version of the strength reduced one could be faster still.

The compiler won't do it without fast math.

ascar · on May 30, 2022

If you look carefully you will find that the slower algorithm uses less instructions and the faster algorithm uses 2xSIMD and more than twice as many instructions.

And yes unrolling the loop carried dependency on the strength reduced version will certainly make it faster as it's the only reason it's slower to begin with.

I encourage you to read my other comment here https://news.ycombinator.com/item?id=31551375

gpderetta · on May 30, 2022

I might be somehow miscounting, but it seems to me that for the slow implementation a loop iteration issues 6 instructions, for the fast one it issues 21, but it is unrolled 4 times (compare the loop counter increment), so it iterates one fourth of the times and for the whole loop it ends up actually issuing slightly less instructions.

edit: to be clear, I'm only arguing about two things that the original parent quitestioned: whether vectorization is not free (it is because wider ALUs require less instructions) and whether the second loop used more instructions (it does not as it is unrolled by 4).

ascar · on May 30, 2022

oh you're right, my bad. Sloppy on my part!

I somehow got tripped up by the 4xSIMD. I was assuming you meant it's using 4x 64bit SIMD there which it doesn't. mulpd and addpd are 2x 64bit, also visible by the xmm instead of ymm registers.

I got sloppy on the difference between all instructions including the loop logic vs just the instructions necessary to do the main computation. Obviously the first is the correct measure and I was sloppy.

tomxor · on May 30, 2022

I think the confusion might be coming from mixing "instructions" with "operations", which are not equivalent especially since we are discussing SIMD which packs in multiple operations (single instruction operating on multiple data). If you re-read this thread and replace instructions with operations it makes more sense.

tpolzer · on May 30, 2022

Unless we're talking extremes (AVX512...), optimizing for power in modern CPUs is almost always optimizing the race to idle.

I.e. the CPU doesn't use much less power if most ALUs are idle, so you might as well use all of them to compute redundant results -> and then turn ~everything off sooner.