Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why does this code execute more slowly after strength-reducing multiplications? (stackoverflow.com)
434 points by ttsiodras on May 29, 2022 | hide | past | favorite | 148 comments


In the post, multiplications and 2 additions are not faster than 2 additions.

The post compares (1) loop code that can be vectorized, as loop rounds are independent and do not depend on the result from the previous round, and (2) an "optimization" that makes calculations shorter, but also makes each loop round depend on the result of the previous round, so this cannot be vectorized.


More generally, “stop storing state, unless it makes the program less complicated.”

The first version is simple. The second version is more complicated. The simplicity is because the first version can be represented as a formula; imagine trying to write out the second version as a formula, and the complexity becomes obvious.

(The addition loop would have to be written as a recurrence relation, which of course means every step depends on the previous step.)

Complexity isn’t always obvious. It’s sometimes common in C programs to write a loop like dst[index++] = src[i], particularly in deeply nested for-loops. In my experience it’s almost always worth rewriting it so that the indices are computable entirely from the loop iteration variables, with no state. It helps you build a mental map of the operations, because you can think of it in terms of geometry (memcpy = copying rectangles onto other larger rectangles) whereas when the index is stateful it becomes quite a lot harder to visualize in higher dimensions. At least for me.

We’ve been building a compiler at Groq for our custom ML hardware. (Roughly, “turn pytorch into our ISA with no extra programmer effort.”) I used to think of posts like this as “Well, modern compilers are so fast, who really cares about autovectorization?” — it turns out you care when you need to write your own compiler. :)

MLIR is pretty cool. It makes a lot of transformations like this pretty easy. The MLIR “krnl” dialect can also automatically transform nested loops into tiled iteration. Graphics devs will know what I mean — no need to loop over 8x8 blocks of pixels manually, just write “for x in range(width): for y in range(height): …” and set the block size to 8.


> stop storing state, unless it makes the program less complicated

I had hoped that functional programming languages would lead us to automatic high-level parallelization. That hasn't happened much in the real world. But what we got is "write code like a functional programmer would even in a low-level language and the compiler and the ISA will parallelize it for you." That surprises me, but I'll take it.


Yeah I mean any functional operation on a list (I'm thinking of JS mainly) should be parallelizable, assuming the interpreter can determine that the order of execution does not matter.

I recall that Scala was the first language that introduced FP to me, and they made the developer explicitly tell that a functional operation could be done in parallel, after which the compiler and runtime would take care of the rest.

This was implemented in the extreme in a project that allowed you to write a functional style series of operations, which were translated to Hadoop map and reduce jobs, each of which could be executed in parallel at large scale. So very compact and succinct code, but very powerful.


It's because under the hood modern processors are embarrassingly parallel, so chip manufacturers like Intel were incentivized to put money into solving the hard problem "turn code written by devs who were trained on old serial-style languages (like C) into something that can pass a benchmark or we can't justify selling new chips because programs won't be faster."

The relative novelty and paucity of functional languages in the ecosystem means the incentive story hasn't happened to get the end-to-end from compiler through to execution so tight yet. I expect it'll happen, and when it does I expect the result will trigger fewer questions like this one because a newer generation of programmers well be less inclined to assume serial execution.


I think it's that it's easier to detect a lack of data dependencies in C than it is to write an efficient Haskell compiler. Well, that and the fact that there is at least an order of magnitude more people working on the former than the latter


memcpy is more like copying a single line, to copy a rectangle you need to call it in a loop or use specialized blit hw/sw.

How is MLIR different from LLVM IR?

And tile/block processing is not that novel for generic compute, graphics use it for other reasons - mostly how the pixel data is in memory and how the pipeline scales with tile size.


LLVM is geared towards providing a single toolchain for frontends to target, but is essentially made to target von Neumann/serial-ish/homogenous machines. You can get an overview of its philosophy in Lattner’s original 2004 paper [1]. MLIR (Lattner 2021 [2]) aims to generalize the compiler toolchain to support different compute paradigms and heterogeneous targets (polyhedral compilation/parallel kernels/afine loops) and so on, by defining a declarative framework of IR _dialects_ and transforms between them. I’m writing my MSc thesis on using C/C++ as the IR, assuming it has enough compiler support for different targets to provide a suitable platform for optimisation transforms using a source-to-source compiler, but MLIR seems like a better first-principles approach, maybe suffering from the impracticality of using it _right now_.

[1] "LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation", Lattner and Adve. https://llvm.org/pubs/2004-01-30-CGO-LLVM.pdf

[2] “MLIR: Scaling Compiler Infrastructure for Domain Specific Computation”, Lattner et al. https://ieeexplore.ieee.org/document/9370308


> In the post, multiplications and 2 additions are not faster than 2 additions.

Reminded me of when I tried to optimize some code using assembly on x86, maybe 8-10 years go. The compiler spit out a DIV [EDI] instruction which was inside a hot loop. I saw how I could easily rewrite it to use a register instead of the indirect memory access.

So I did, and it was significantly slower, like 10-15%... I even took care to align the loop target address and so on. If I changed my code to use the memory indirection it was measurably faster.

And with that I decided hand-optimizing x86 assembly was definitely a skill to leave behind.


It would be interesting to test it on 486, Pentium, and Pentium Pro, as you get much different execution and caching behavior between those architectures.


You probably added an extra instruction or two to put the value in a register? The CPU can already split memory accesses into uops and cached accesses are fast, so there's no point in doing that because it'll just waste an additional register (vs. using one of the many the renamer generates) and add instructions to decode.

x86 is fundamentally a CISC; if you treat it like a RISC, it will definitely disappoint.


Not really, the compiler did the round-trip into the local variable (memory), I did it via register.

I asked around at the time and someone mentioned that I might have overtaxed certain execution ports or something like that, but yeah that just cemented my belief that x86 optimization is not my cup of tea anymore. Better to spend time learning how to write code the compiler can optimize well.


It's hard to know exactly without staring at the code and knowing the exact CPU, because nothing stands out as a red flag. If you used an extra register, maybe you caused a spill (pushed something else out of registers into memory). Maybe you made the loop code bigger and it no longer fit in the loop stream buffer (if that model had one). Maybe you hit a weird frontend decode issue and it could only decode N-1 instructions per clock in that line instead of N, and that was critical to the loop's performance. Maybe your code layout changed for another reason, or the memory layout, and you got some bad cache aliasing.

These things are knowable if you have enough curiosity and maybe masochism :)


What made me give up was when I found that an optimization for one x86 processor was an impairment on another. Your DIV example might have had a different outcome on an AMD chip, or on the next- or previous-generation part from Intel.

Nowadays, counting cycles is a game for monks who don't actually have to get anything done.


The compiler doesn’t know anything about optimizing x86 code either. The actual details there are too secret for Intel to want to accurately describe them in gcc, they’re different across different CPUs, and compilers just aren’t as good as you think they are.

(But CPUs are usually better than you think they are.)


It's not so secret, TBH. Usually the intel microarchitecture manuals are detailed enough to describe how many and what type of execution ports there are, how many stages in the pipeline, the size of the reorder buffer, latency of most u-ops, and any frontend hazards. The super secret stuff are things like the design of the branch predictors, memory disambiguation, etc, as well as the low-level tricks to optimize each of these down to the fewest gate delays (for high clockspeeds, etc), as well as where and how they figure out placement, etc.


The front end (the decoder stage and branch predictors) are what would theoretically be important for compilers as they’re the bottleneck. But Intel’s optimization advice doesn’t say much about branches anymore, they pretty much want you to rely on them to take care of it.

That’s only part secrecy and part to give them freedom to change it. It is of course somewhat described in their patents.


There are sometimes vague hints about things to avoid, e.g. putting too many branches on the same cache line, and they usually publish the size of their tables, typically 4K, 8K entries these days? But the actual predictors are wicked devils; they clearly are doing some tournament predictors, using tiny ML modules (perceptrons), and god knows what else. I studied this carefully when trying to make good Spectre gadgets, but it is very very difficult to 100% trick (or utilize!) a branch predictor these days--they just learn in interesting ways...and entries alias :-)

I honestly don't know if it's worth it to try to optimize branch prediction in compilers these days, beyond the obvious step of putting the highest probability target next (for fallthrough prediction) and generally laying out hot parts of the code together. TurboFan and most other dynamically-optimizing compilers put rare code at the end of functions, and that's a huge boost.


Would it not be in Intel's best interest to have popular compilers be able to squeeze the most performance out of its own line of CPUs?

I'm wondering how the incentives play out to keep this stuff private?


Intel sells a compiler. I've only used it briefly a long time ago, but its code generation was well ahead of MSVC at the time even for scalar (non-SIMD) stuff, and I remember GCC was far behind too (it would generate roughly the same performance in microbenchmarks, but far more bloated.)


Intel has released at least some of their software suite for free (as in beer, not as in speech).

https://www.intel.com/content/www/us/en/developer/articles/n...

I think software is not a huge profit center for them.

The original comment presents:

> The actual details there are [1] too secret for Intel to want to accurately describe them in gcc, [2] they’re different across different CPUs, [3] and compilers just aren’t as good as you think they are.

2 and 3 could just be the whole story. Although we haven't actually accumulated any evidence here for 3, given that the original story was about surprisingly getting beat by a compiler, despite performing a seemingly obvious optimization.


You can just look at gcc and llvm source. GCC’s x86 backend is maintained by an Intel employee and it doesn’t have especially detailed descriptions of any cpu in -mtune.

Actually, compiler optimizations like scheduling tend to be neutral to negative on x86 because they increase register pressure. You’d probably want to do “anti-scheduling” and hope the CPU decoder takes care of it if anything.


https://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler

Interesting that it generates suboptimal code for non-Intel processors.


Cached accesses are not fast anymore. Modern chips can retire a dozen or even more instructions in the time it takes to move a word in from L1 cache.


The other issue is instruction-level parallelism, as another poster in TFA pointed out. Even within a single loop iteration the "unoptimized" code is more likely to exploit multiple ALUs if they exist, regardless of vectorization instructions.


Ok, we've reverted the title to what the article says.

(Submitted title was "Multiplications and 2 additions are faster than 2 additions")


To be precise, they're both "vectorized" in the sense that both versions are using SSE vector instructions (and in fact, Clang will even generate AVX instructions for the first version if you use -mavx2). The difference is really the data dependency which has a massive effect on the ability of the CPU to pipeline the operation.

For the first version w/ AVX I get:

$ perf stat ./a.out [-] Took: 225634 ns.

Performance counter stats for './a.out':

            247.34 msec task-clock:u              #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             2,009      page-faults:u             #    8.122 K/sec                  
       960,495,151      cycles:u                  #    3.883 GHz                    
     2,125,347,630      instructions:u            #    2.21  insn per cycle         
        62,572,806      branches:u                #  252.982 M/sec                  
             3,072      branch-misses:u           #    0.00% of all branches        
     4,764,794,900      slots:u                   #   19.264 G/sec                  
     2,298,312,834      topdown-retiring:u        #     48.2% retiring              
        37,370,940      topdown-bad-spec:u        #      0.8% bad speculation       
       186,854,701      topdown-fe-bound:u        #      3.9% frontend bound        
     2,242,256,423      topdown-be-bound:u        #     47.1% backend bound         

       0.247734256 seconds time elapsed

       0.241338000 seconds user
       0.004943000 seconds sys

For the second version with SSE and the data dependency I get:

$ perf stat ./a.out [-] Took: 955104 ns.

Performance counter stats for './a.out':

            975.30 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             2,010      page-faults:u             #    2.061 K/sec                  
     4,031,519,362      cycles:u                  #    4.134 GHz                    
     3,400,341,362      instructions:u            #    0.84  insn per cycle         
       200,073,542      branches:u                #  205.140 M/sec                  
             3,192      branch-misses:u           #    0.00% of all branches        
    20,091,613,665      slots:u                   #   20.600 G/sec                  
     3,110,995,283      topdown-retiring:u        #     15.5% retiring              
       236,371,925      topdown-bad-spec:u        #      1.2% bad speculation       
       236,371,925      topdown-fe-bound:u        #      1.2% frontend bound        
    16,546,034,782      topdown-be-bound:u        #     82.2% backend bound         

       0.975762759 seconds time elapsed

       0.967603000 seconds user
       0.004937000 seconds sys
As you can see the first version gets nearly 3x better IPC (2.21 vs 0.84) and spends half as much time being backend bound.


> To be precise, they're both "vectorized" in the sense that both versions are using SSE vector instructions

No, just because it's using SSE doesn't mean it's vectorized. SSE has both scalar and vector instructions.


The first comment on the code does say precisely this.


SIMD vectorization is like quantum mechanics, it becomes classical if you look.


Tldr: autovectorizer transforms the first code into SIMD instructions, but the second into 64 bit instructions.

SIMD is very powerful, and modern compilers can sometimes simd-ify your code automatically.

-------

The second code can probably become SIMD as well, but it's beyond GCC's ability to autovectorizer it in that form. I kinda want to give it a go myself but don't have time today...

Autovectorizers are mysterious, often harder to see and use than explicitly SIMD code (like OpenCL or CUDA).


The other essential insight is that the second version has data dependencies between loop iterations. Without that, the tldr is incomplete.


That is the 'why' autovectorization fails. But it seems solvable in this case pretty easily actually.

EDIT: Solvable by a human. I dunno much about compilers though, so I dunno if there's some kind of language issue that would prevent the unrolling of this dependency.


I think the point GP is making is that even without vectorization, the data dependency causes stalls even in normal, single data instructions. That is, a data dependency between iterations of loops will hurt performance even for non-vectorizable calculations (or on CPUs with high ILP but no really good vector instructions, which granted is probably a small pool since both those things came about at about the same time).


I think that is the point Peter Cordes is trying to make (quite politely but firmly) over there on stackoverflow. It's not only about autovectorization. The main point there is the loop-carried dependency that prevents both the compiler (autovec) and the processor (ILP) to do their thing.

Loop-carried dependency is the big culprit here. I wish we had a culture of writing for loops with the index a constant inside the loop, as in the Ada for statement, and not the clever C while loops or for with two running variables... Simpler loop syntax makes so many static analyses 'easier' and kind of forces the brain to think in bounded independant steps, or to reach for higher level constructs (e.g. reduce).


I wonder if it becomes a math problem to optimize then, like Euler solving the sum of 1-100 by adding 1 to 100 and multiplying by the 50 pairs of numbers that operation created to get 5050?


It's a different issue altogether. Even in vectorized code artificial data dependencies are an issue. E.g., for fast dot products (not actually usually applicable on most input sizes because of memory bandwidth, but related problems can be CPU bound) you'll roughly double your execution speed by alternating between two running totals and merging those at the end, and that gain is mostly orthogonal to the question of whether those running totals are vectorized.

Edit: And many modern compilers have been doing that particular optimization for a few years anyway, but it's still an important idea to keep in mind for any non-trivial graph of operations.


> The second code can probably become SIMD as well, but it's beyond GCC's ability to autovectorizer it in that form.

Usually, for floating point operations the compiler simply has no chance to do anything clever. NaNs, infinities and signed zero mean that even the most "obvious" identities don't actually hold. For example, x + 0 == x (where the == is bitwise) does not hold for x = -0.


> For example, x + 0 == x (where the == is bitwise) does not hold for x = -0.

But when operating on floating point numbers, == is not bitwise, and -0 == +0.


If the user provides code where the result should be +0.0 and the compiler emits code that results in -0.0 (and the user has not explicitly enabled relaxed IEEE754 conformance), that's a bug in the compiler.


That is correct. But we were talking about how the == operator compares zeros. According to IEEE 754, -0 == +0 even though their bit patterns are not identical. And, by the way, inf != inf even if their bit patterns are identical.


> But we were talking about how the == operator compares zeros

Well, you were and I wasn't, and it wasn't clear whether you meant to contradict my point. I guess I could've made it more clear that I meant "using == to denote bitwise equality here only" and not "btw, == is bitwise for floats", but either way it should be clear now.


You mean nan right?


Yes. I mean nan != nan. I don’t know why I wrote inf. Thank you for correcting.


That doesn’t matter for the case where a compiler would like to replace x + y by x because it knows y == 0.

The compiler would have to test for the x == -0 case, and doing that ¿rarely/never? is faster than computing x + y.


TurboFan does reason about whether a value can be -0 and will strength reduce if it's known a -0 will not occur. Further, it reasons about observations of a value; if it is known that a -0 is never observed, e.g. the value is only ever converted to an integer, or compared, or otherwise only flows into expressions that treat 0 == -0, then it will do more aggressive optimizations.


There is a compiler switch for that too: -fassociative-math -funsafe-math-optimizations


I like how the flag combines in that second one to create "fun safe", pretty much flipping its meaning.


so x+0 is 0?


No. -0.0 + 0 is not necessarily bitwise-equal to 0.0. Though it arguably could be, depending on details.


-0.0 + x = x, even if x should be -0.0 or +0.0.

+0.0 + x, on the other hand, is not always equal to x.


Note that IEEE 754 defines exactly which zero you get in every combination of operands. -0 + 0 is actually 0.


How can one go about making one's code apt for a compiler to be able to do these kinds of things?


A good way is to have your data arranged in structs of arrays rather than in arrays of structs. This allows the compiler to generate code which just loads in linear sections of memory to SIMD registers. It’s also just more cache efficient in general.

Check out data oriented design if you aren’t already familiar.


Note that sometimes it's sufficient to have arrays of structs of arrays. That is you don't need to necessarily go whole hog.


That very much depends on access patterns. If you’re performing an operation I’ve ever object with a certain field, struct of array makes sense. If you’re doing an operation which uses many fields on some arbitrary dynamic randomly ordered subset of objects, then array of structs will yield better because at least you recover some memory locality.


You study autovectorizers, then you enable autovectorization warning flags, and carefully read your compilers output.

If the compiler says autovectorization failed, you rewrite the code until the autovectorizer works.

https://docs.microsoft.com/en-us/cpp/build/reference/qvec-re...


hm, I have had limited success with such nudging - it's certainly not a programming model with a predictable outcome. And as soon as the compiler or its flags are updated, maybe the situation changes.


I also wonder if there's any compiler that allows you to hint that you expect certain optinizations to occur (like vectorization), and if they do not, fails to compile at all.


Don’t bother. Autovectorization almost fundamentally doesn’t work.

You can write code using explicit vectors and it’s not too bad, though less cross platform than you’d like.

An actual answer involves things like restrict and alignment keywords.


Explicit vectors can be made portable (to SSE4/AVX2/AVX-512/NEON/SVE/SVE2/RISC-V V/WASM SIMD) using the Highway library which gives you wrapper functions over the platform's intrinsics, emulating missing functionality where required: github.com/google/highway. Disclosure: I am the main author.


The problem is not the ability of the compiler, is that it's not numerically equivalent, so the compiler isn't allowed to do that optimization.


How would the second approach be vectorized given each iteration's input has dependence on previous iteration's output??


Unroll the dependency until you are longer than the SIMD width.

Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.

Or in other words: i+7 can depend on i-1 no problems.


> Unroll the dependency until you are longer than the SIMD width.

> Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.

Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.

    void compute()
    {
        const double A = 1.1, B = 2.2, C = 3.3;
        const double A128 = 128*A;
        double Y[8], Z[8];
    
        Y[0] =               C;
        Y[1] =     A +   B + C;
        Y[2] =   4*A + 2*B + C;
        Y[3] =   9*A + 3*B + C;
        Y[4] =  16*A + 4*B + C;
        Y[5] =  25*A + 5*B + C;
        Y[6] =  36*A + 6*B + C;
        Y[7] =  49*A + 7*B + C;
        Z[0] =  64*A + 8*B;
        Z[1] =  80*A + 8*B;
        Z[2] =  96*A + 8*B;
        Z[3] = 112*A + 8*B;
        Z[4] = 128*A + 8*B;
        Z[5] = 144*A + 8*B;
        Z[6] = 160*A + 8*B;
        Z[7] = 176*A + 8*B;
    
        int i;
        for(i=0; i<LEN; i+=8) {
            data[i  ] = Y[0];
            data[i+1] = Y[1];
            data[i+2] = Y[2];
            data[i+3] = Y[3];
            data[i+4] = Y[4];
            data[i+5] = Y[5];
            data[i+6] = Y[6];
            data[i+7] = Y[7];
            Y[0] += Z[0];
            Y[1] += Z[1];
            Y[2] += Z[2];
            Y[3] += Z[3];
            Y[4] += Z[4];
            Y[5] += Z[5];
            Y[6] += Z[6];
            Y[7] += Z[7];
            Z[0] += A128;
            Z[1] += A128;
            Z[2] += A128;
            Z[3] += A128;
            Z[4] += A128;
            Z[5] += A128;
            Z[6] += A128;
            Z[7] += A128;
        }
    }


> Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.

Yeah, something like that. I haven't double-checked your math, but the idea is what I was going for.

I'm always "surprised" by the fact that CPUs care more about bandwidth rather than latency these days. A lot of CPUs (Intel, AMD, ARM, etc. etc.) support 1x or even 2x SIMD-multiplications per clock tick, even though they take 5 clock ticks to execute.

I guess the original "simple" code may have had a multiply in there, but that's not a big deal these days (throughput wise), even though its a big-deal latency wise.

So getting rid of those multiplies and cutting down the latency (ie: using only add statements) barely helps at all, maybe with no measurable difference.

One of these days, I'll actually remember that fact, lol.


On my machine, your code is faster for smaller LEN values. I'm not sure why this is though.


8x 64-bit is 512-bit, which is designed for AVX512. You'll probably need AVX512 to fully benefit from unrolling x8.

4x 64-bit is 256-bit, which requires special compiler flags for 256-bit AVX2, but most x86 CPUs should support them these days.

2x64-bit is 128-bit, which fits in default SSE 128-bit SIMD with default GCC / Visual Studio compiler flags.


If they were integer variables, I guess the compiler would have done that, but you can't really do that with floats because i+A+A is not necessarily i+2*A. (Of course, in this particular example, the difference doesn't matter for the programmer, but the compiler doesn't know that!)

I think there's some gcc option that enables these "dangerous" optimizations. -ffast-math, or something like that?


No the computer would have been unlikely to be able to figure out the math to coalesce 8 recursive additions into one operation.


> as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other

I really don't see how that works in improving this.

You can only calculate i+8, for calculating i+9 you depend on 8. And you can't go in strides either since i+16 depends on i+15 which you've not calculated so far unless you want to intermix the stateful and non-stateful code. I'd rather not go there.


both versions use SSE and are pipelined, the problem with the second one is data dependency, only two adds but the second one directly depends on the first ones result = stall


SSE includes "scalar" adds (addsd), which are a 1x floating point instruction. These are "non-SIMD" instructions, serving as a replacement for the legacy x87 instructions.

There is also "parallel" adds (addpd).

Carefully look at the assembly language, the 1st version uses parallel adds (addpd) and parallel multiplies. The 2nd version uses scalar adds (addsd)

The other major point is that the 2nd version uses a singular move qword (64-bit) per loop iteration, while the 1st version is using the full 128-bit move per loop iteration.

---------

SSE is used for scalar double-precision these days, because scalar-SSE is faster than x87 instructions... and better matches the standards (x87 had "higher precision" than the IEEE specs, so it has different results compared to other computers. SSE is closer to the specs)


> Tldr: autovectorizer transforms the first code into SIMD instructions, but the second into 64 bit instructions.

That's not the tldr. It's actually more wrong than right.

The actual TLDR is: loop dependencies prohibit the compiler _and_ your CPU from parallelizing. The SIMD part is the compiler, the more important part here is the CPU though.

Long explanation:

Let's start with the compiler. Yes, the first version can be vectorized and you can see that in the included screenshots of the assembly output. The use of addpd [0] (add 2 pairs of 64bit doubles in simultaneously) instead of addsd (add 1 pair of 64bit doubles). Both operations have identical throughput/latency on any architecture not too ancient (e.g. check information of the corresponding intrinsic here [3]. Unfortunately Agner instruction_tables doesn't include the ADDSD for most architectures.) So while you can crunch 2 operations using SIMD at the same time, you are still doing 3 multiplications and 2 additions instead of 2 additions. The multiplications and addition have the same throughput at least on Skylake. So we can use simple math and 5/2 is still more than 2!

Now to the CPU part. The loop dependency prohibits the compiler to properly utilize instruction pipelining [4] and now differences in throughput vs latency come into play. In a pipelined scenario the CPU can work on multiple steps of the instruction cycle in parallel (from fetching the instruction, then decoding it, to executing the operation and eventually writing the result back to the register) and thus achieve higher throughput. However, if your next instruction depends on the instruction before the CPU has to finish all the steps of a pipeline from instruction start to finish before the next instruction can start. This is the latency of an instruction. For example mulsd/mulpd have 8 times higher latency than throughput on Intel Skylake.

So while SIMD comes in at a factor of 2, the loop dependency stops the CPU from realizing a potential factor of 8.

Peter Cordes also correctly points this out in his comments and answer to the question on stackoverflow.

[1] https://www.felixcloutier.com/x86/addpd

[2] https://www.felixcloutier.com/x86/addsd

[3] https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

[4] https://en.wikipedia.org/wiki/Instruction_pipelining


In short, the CPU can execute up to 6 or 8 instructions at once, but only if there are no dependencies between instructions. In the second version every operation depends on the previous result.


I think it's worth pointing out that the reason why these two examples execute at different speed is due to how compiler translated code AND because CPU was able to parallelize work. Compilers take knowledge about target platform (e.g. instruction set) and code and translate it into executable code. Compiler CAN (but doesn't have to) rewrite code only if it ALWAYS produces the same result as input code.

I feel like last 110-15 years (majority of) people have stopped thinking about specific CPU and only think about ISA. That works for a lot of workloads but in recent years I have observed that there is more and more interest in how specific CPU can execute code as efficiently as possible.

If you're interested in the kind of optimizations performed in the example you should check out polyhedral compilation (https://polyhedral.info/) and halide (https://halide-lang.org/). Both can be used to speed up certain workloads significantly.


Seems like you can have the cake and eat it too, by manually parallelizing the code to something like this:

    double A4 = A+A+A+A;
    double Z = 3A+B;
    double Y1 = C;
    double Y2 = A+B+C;

    int i;
    // ... setup unroll when LEN is odd...

    for(i=0; i<LEN; i++) {
        data[i] = Y1;
        data[++i] = Y2;
        Y1 += Z;
        Y2 += Z;
        Z += A4;
    }
Probably not entirely functional as written, but you get the idea: unroll the loop so that the data dependent paths can each be done in parallel. For the machine being considered, a 4 step unroll should achieve maximum performance, but of course, you get all the fun things that come with hard-coding the architecture in your software.


The idea is right, but some details are wrong. You need a separate Z for each Y. But even if that's done, it is indeed faster.


I'm shocked and aghast to hear you found a bug in my code - I assure you it compiled and ran flawlessly in my brain.


Looks like someone wrote pretty good parallelizable code on the original question, here: https://stackoverflow.com/a/72333152


That code isn't faster for me while Manholios's is. And his gets even faster with 4x parallelization.


The part about data dependencies across loop iterations is fascinating to me, becuase it's mostly invisible even when you look at the generated assembly. There's a related optimization that comes up in implementations of ChaCha/BLAKE, where we permute columns around in a kind of weird order, because it breaks a data dependency for an operation that's about to happen: https://github.com/sneves/blake2-avx2/pull/4#issuecomment-50...


The pipelining issue is interesting because my reaction becomes “shouldn’t the CPU just come with a larger vector size and then operate on chunks within the vector to optimize pipelining?” but then I realize I’m just describing a GPU.


I doubt repeatedly adding floating point numbers is a good idea. With every addition the sum increases further away from the addend, and with their growing relative difference problems grow as well.

Just because the algebra works it doesn't guarantee that the code does, too.


Your points are factually correct, but in practice not a big concern, if your floating point value is much more precise than necessary for the numbers used.

E.g. if you use doubles for the range of 32bit integers even adding 1.0 2 billion times to 2 billion still ends up at 4 billion. Even adding 1.0 20 billion times to 20 billion ends up at 40 billion. Now adding 0.1 20 billion times to 2 billion ends up 1908 short on my CPU, i.e. about 19080 absorptions/rounding errors occured. You need some serious differences and amount of operations to actually trigger errors.


You're talking about the best case, but we need to take the reasonable worst cases into account, where x + y nearly cancel etc.


The author of the post is aware of this. They explicitly say that is not the point of the question.


The algebra works for reals, binary floating point is a different beast!


It’s almost like managing the state yourself on a modern cpu is a complicated task. Automatic vectorisation, reordering, etc can all be more easily done to a pure declarative program. Imperatively managing the state and the control flow with an ancient language like C etc really does no longer reflect the underlying hardware which is significantly more advanced.


What would be the difference in power consumption from each method? (Would it be always better to multiply? If so why not multiply by one?)


The general rule to follow in power consumption on CPUs is to do your work quickly and then get to sleep. Propagating clock is going to eat the bulk of your power. The mild difference between multiply and add in actual usage is inside the noise (orders of magnitude smaller). The bigger penalty in this case is the inter-iteration dependency, which, vectorized or not, runs the risk of holding up the whole show due to pipelining.

As a performance rule on modern processors: avoid using the result of a calculation as long as you reasonably can (in tight loops... You don't want to be out of cache.).

Have fun threading the needle!


Usually faster version always consumes less power, as this allows the core more time in sleep. This is known as the race-to-sleep or race-to-idle paradox.


The problem here is that the additions depends on values computed in the previous iteration of the loop. The version with multiplication is faster because there is no dependencies with the previous iteration so the CPU has more freedom scheduling the operations.

The power consumption is a good question.


Scheduling plays a part, but it is definitely more about vectorization.


It's almost certainly more about scheduling than vectorization. The data dependencies is going to constantly stall the CPU pipeline, so it's just not able to retire instructions very quickly. The SIMD part is almost certainly a red herring. It's helping, but it's far from why it's so much faster. Tiger Lake can retire 4 plain ol' ADD operations per clock[1] - you don't need SIMD / vectorization to get instruction level parallelism. But you do need to ensure there's no data dependencies. The data dependency here is the 90% cost. The SIMD is just the cherry on top.

1: https://www.agner.org/optimize/instruction_tables.pdf


In this case since the slower one is spending most of its time stalled out, the faster one will probably be more power efficient. This isn't an AVX512 power-gobbling monstrosity issue.


Besides the autovectorization, the second version also has two additional assignments. Depending on how the storage is used for these, whether it’s to registers, L1/L2, or stack, there might be performance hit.


Why would it be in anything but registers?


It's fairly obvious: the rewrite prevents parallelization because floating-point isn't associative.

You'd need to parallelize it explicitly (which can be done by just unrolling the loop).


First: we need to finally stop the harmful myth that floating point multiplication is slower than addition. This has not been true for a long while.

Second: why are so many people insisting that the loop is auto-vectorised? Is there any evidence to that? Data dependencies alone explain the observed performance delta. Auto-vectorization would have resulted in a higher speedup.


Vectorisation is not free, There is one other dimension to optimise for: power.

The suggested "slower" optimisation does fundamentally use less instructions. Chucking more hardware at parallelisable problems makes it run faster but does not necessarily reduce the power requirements much because there are fundamentally the same number of instructions, it's just gobbling up the same power over a shorter period of time - The "slower" serial algorithm uses less instructions and in theory less power in total. Disclaimer: I mean power vs speed fundamentally in theory, in practice there may be no measurable difference depending on the particular CPU and code.


Do you have a reference for that? I googled and I couldn't find anything good.

While it might sound intuitive that SIMD instructions consume more power, I don't think that's necessarily true to a relevant degree in practice. My understanding is CPU power consumption is mostly tied to inefficiences that cause energy loss via heat, while the actual computation doesn't consume any energy per se. So electrons traveling a more complex path probably cause somehwat more energy loss as there is more wire/transistors to pass. But most of the total loss doesn't actually occure in the ALU. Empirically from what you can see operating systems do, the most effective way of consuming less power is actually running on a slower clock cycle and the most effective way to achieve that is getting work done faster and that's not tied to the number of instructions.

The Stackoverflow question here [1] seems to suggest that SIMD vs no SIMD has a neglectable overhead compared to entering a lower power state sooner.

[1] https://stackoverflow.com/questions/19722950/do-sse-instruct...


I think in summary what you are alluding to is instruction decoding and scheduling as the sibling comment points out, which is indeed a large cost in both speed and power.

> SIMD vs no SIMD has a neglectable overhead compared to entering a lower power state sooner.

Yes on the same CPU as I suggested, real world difference may be unmeasurable. However note that this particular case is interesting because it's not comparing fewer serial multiplies to more SIMD multiplies, it's comparing SIMD multiplies to no multiplies but with a serial constraint due to variable dependence... i.e it's SIMD vs no multiply without any other difference in number or type of ALU ops... which again could make no difference on a big x86 in practice, but it would be interesting to know.

All of this changes if you are coding for a lower power device and have choice of hardware.


But you’re not counting static power of the CPU and the whole system. It is possible for the vectorized code to consume less power overall because it can complete faster, using less static power.


We mean energy, right? The surprising fact is that the energy cost of executing an instruction (scheduling etc.) is much higher than the actual operation. Thus SIMD amortizes that per-instruction overhead over multiple elements, leading to 5-10x gains ( https://pdfs.semanticscholar.org/f464/74f6ae2dde68d6ccaf9f53...).

Even in this example, which apparently has 4x vector instructions vs 1x scalar, AVX-512 (and probably even AVX2) would reduce energy usage because they can do 8 (or 4 for AVX2) 64-bit elements per cycle.


> The surprising fact is that the energy cost of executing an instruction (scheduling etc.) is much higher than the actual operation.

Good point, decoding and scheduling are expensive and SIMD certainly eliminates a lot of that. However the alternative algorithm has even less decoding and scheduling to do, since it completely eliminates the multiplication operations without increasing the number of additions. But even then I wouldn't be surprised that it makes no difference to energy on any x86, as I said it was more of a fundamental observation - for a different application this is useful when selecting the actual hardware if you are not already constrained to a particular chip.


The slower algo will use more instructions as will have to run for more iterations. The faster algo will use wider ALUs that are more power hungry, so it appears a wash. But because less instructions are in flight, less power need to be spent to keep track of them or renaming registers, caches need to powered for a smaller amount of time and so on.


>The slower algo will use more instructions

The whole point of this thread is realization of opposite. Slower algo executes only 2 instructions in a loop, but second one directly depends on the result of the first (induces pipeline stall) while the fast version brute forces a ton of instructions at full CPU IPC.


If you look carefully at the generated assembly shown in the article, the vectorized loop actually executes less instructions in total as while an iteration is longer it more than make it up by iterating less times.

Edit: btw, while the fast version has bo loop carried dependencies and it uses 4x SIMD, it us only twice as fast

I wonder if a manually unrolled (with 4 accumulators) and vectorized version of the strength reduced one could be faster still.

The compiler won't do it without fast math.


If you look carefully you will find that the slower algorithm uses less instructions and the faster algorithm uses 2xSIMD and more than twice as many instructions.

And yes unrolling the loop carried dependency on the strength reduced version will certainly make it faster as it's the only reason it's slower to begin with.

I encourage you to read my other comment here https://news.ycombinator.com/item?id=31551375


I might be somehow miscounting, but it seems to me that for the slow implementation a loop iteration issues 6 instructions, for the fast one it issues 21, but it is unrolled 4 times (compare the loop counter increment), so it iterates one fourth of the times and for the whole loop it ends up actually issuing slightly less instructions.

edit: to be clear, I'm only arguing about two things that the original parent quitestioned: whether vectorization is not free (it is because wider ALUs require less instructions) and whether the second loop used more instructions (it does not as it is unrolled by 4).


oh you're right, my bad. Sloppy on my part!

I somehow got tripped up by the 4xSIMD. I was assuming you meant it's using 4x 64bit SIMD there which it doesn't. mulpd and addpd are 2x 64bit, also visible by the xmm instead of ymm registers.

I got sloppy on the difference between all instructions including the loop logic vs just the instructions necessary to do the main computation. Obviously the first is the correct measure and I was sloppy.


I think the confusion might be coming from mixing "instructions" with "operations", which are not equivalent especially since we are discussing SIMD which packs in multiple operations (single instruction operating on multiple data). If you re-read this thread and replace instructions with operations it makes more sense.


Unless we're talking extremes (AVX512...), optimizing for power in modern CPUs is almost always optimizing the race to idle.

I.e. the CPU doesn't use much less power if most ALUs are idle, so you might as well use all of them to compute redundant results -> and then turn ~everything off sooner.


I curious about how an optimiser determines that a block of code can be vectorised. It's trivial to see this in the initial version of compute() but I'm not sure how an optimiser does this. Is it as simple as checking that A, B, and C are cost?

And how far would an optimiser typically take its analysis? For example, if B was defined inside the loop as (non -const) A * 2 ? Or as A * f() , where f() always returns 2, maybe using a constant or maybe a calculation etc.

Seems like a very hard problem,


Oh man I have been smoked by data-dependency like this so many times. I've gotten a lot better over the years at "seeing" data dependencies in godbolt or whatever, but it still slips through by my code and that of my colleagues way more often than I'd like.

Is anyone aware of good tooling for automatically catching this sort of thing even some of the time?


I'm not sure if there are existing rules for it, but you could write a CodeQL query looking for data dependencies in loops. Obviously dependencies are sometimes required, but it at least could tell your they were there.


Don’t use state in a loop unless you have to? Avoiding such use of state also makes code a lot easier to read and debug - the performance benefits are really just a bonus.


I'm running into an interesting result -- on my machine, the fast version runs at the same speed at LEN=1000000 (the default in the sample), but starts running 1.5 times faster at LEN=500000 and ends up twice as fast at LEN=100000 and lower. This is with gcc -O3. Why could this be?


cache size pushing you out to high latency memory


For fun, I tried this in go on M1 Mac (basically because benchmarking in go is so easy)

And... the two codes runs with exactly the same speed, on M1 Mac, with go.

edit: of course, with go, you can manually parallelize the faster option with goroutines... but that does something else, doesn't it. (and it's 500x faster.)


Does the Go compiler auto-vectorise? Vectorisation appears to have been the cause of the weird performance.


No, I don't think so.


Funny that this question gained 180 upvotes in 10 days when it also could have received the reverse for being quite lacking in things the author has tried to figure the (rather obvious) data dependency out on his own.


I'll put my hand up and say none of the post or answers were obvious to me. I found it all very interesting.


Given that in terms of “absolute work” done, the optimization does hold true, is there any situation where it would be beneficial to implement this?

(Super low energy processors, battery powered, etc?


Certainly! It makes sense in processors that don't do SIMD or speculative execution. There are a lot of those, but mostly for embedded stuff.


> or speculative execution

This isn't actually taking advantage of speculative execution that much. The only speculation here would be in the predicting the loop repeats, which loop unrolling would mostly negate for CPUs that don't do speculative execution.

The data dependency issue, however, would still be a punishing factor. You'd need a CPU that isn't superscalar, which does exist but is increasingly less common (even 2014's Cortex-M7 was superscalar, although it kinda sounds like ARM backed off on that for later Cortex M's?)

Also many low-end / embedded CPUs that are in-order will still do branch prediction.


Those are also CPUs were multiplication is most likely to be significantly more expensive, or not implemented in hardware at all (though almost everything has a multiplier these days).


How would the energy cost of the two methods compare?


Is the implication here that tail call optimizations don't work anymore? They might seem to do the proper thing on the language level but the CPU just can't think that way.


Tail calls will work the same as loop. So if you have data dependencies between loop iterations or between tail-call-eliminated stack frames, then it will be slower than if you do not have those dependencies.


At least in some languages like Elixir and probably most FP languages, tail calls are practically only used when said dependencies exist, so their usage can perhaps be a marker for when some optimizations are not possible.


How does this relate to tail calls?


Tail calls have a collector that accumulates the result, very similar pattern as in the example. It's an optimization in LISP and similar languages.


On older arcs there is also a cost for casting int to float. Not like float to int but anyway.

The general advice when I worked with CG was to use deltas in iterations.


Is this kind of parallelization possible on only compiled language? Or is it also possible for interpreted language like JavaScript?


The same considerations apply, though not always. JavaScript can be JIT compiled, though the JIT might not make the same optimisations as a C compiler. It's running on the same CPU though.


If we ignore the details, then this is a perfect example of how simpler code should be preferred _by default_ over complex code. Both for performance and reasoning. In this case and often elsewhere, these are related things.

I'm taking the definition of simple and complex (or complected) from this talk[0]. In short: Simple means individual pieces are standing on their own, not necessarily easy or minimal. Complex means individual pieces are braided together into a whole.

The first code is simpler than the second. Just have a look at the two solutions and you see what I mean: The individual instructions in the first stand on their own and clearly declare what they mean as well. The second example is more complex, because each line/subexpression cannot be understood in isolation, there is an implicit coupling that requires the programmer _and_ the machine code interpreter to understand the whole thing for it to make sense. The CPU apparently fails at that and cannot optimize the machine instructions into more efficient microcode.

The example illustrates performance benefits of simpler code, but there are others too:

For example a type checker might be able to infer things from simpler code, but not from complex code. Again, complexity is about coupling, which can for example arise from making assumptions about things that are out of bounds or logic that happens in some other part of your program, which _this_ part relies on.

Things like these can either be outright rejected by a type checker or simply ignored and sometimes we can provide additional ceremony to make it happy. But there is always an underlying question when these issues arise: Can I express this in a simpler way? It's sometimes possible to describe rich semantics with simpler, more primitive types and explicit coordination.

Another thing that comes to mind are rich ORMs. They can be very convenient for common cases. But they have footguns for the general case, because they _complect_ so many things, such as validation, storage, domain rules, caching etc. And they are leaky abstractions because we have to do certain things in certain ways to avoid bloated SQL, N+1 etc.

They are not simple (and certainly not declarative) so we program against a black box. There are simpler "ORMs" that I very much like, but they typically only provide a minimal set of orthogonal features, such as query builders, and mapping results into plain data structures. It is simpler to use these kind of orthogonal tools.

Last but not least: Simple code can be pulled apart and changed more easily without having to worry about introducing problems. The substitution rule or referential transparency enables this by guaranteeing that each expression can be replaced by the value it will evaluate to. This also implies that other expressions that evaluate to the same value can be substituted freely.

[0] Simple Made Easy - Rich Hickey at Strangeloop 2011 https://www.youtube.com/watch?v=LKtk3HCgTa8


You can't ignore the details if you want the fastest performance. Unfortunately sometimes the solution that's the fastest is also more complex. Simplicity tends to win and also is less likely to be buggy, easier to maintain etc but sometimes the fastest code deliberately introduces coupling, hardware specific cache shenanigans or unreadable hand written asm and/or inlined SIMD. This is often counter to your hypothesis.


> sometimes the fastest code deliberately introduces coupling, hardware specific cache shenanigans or unreadable hand written asm and/or inlined SIMD.

>> this is a perfect example of how simpler code should be preferred _by default_ over complex code.

Extra emphasis on *by default*.

You are both on the same side of the coin. This is the essence of "premature optimization is the root of all evil". Simplicity should be the default. Once you know your implementation is solid, then you can go back and optimize.

TFA is quite contrived, but imagine a much more complex process. You see some code:

    for i in loop: 
        foo[i] = myfunc(foo[i], bar[i])
You have to go in and dig into why exactly foo is being fed back in at every step. But if instead you saw:

    for i in loop:
        foo[i] = myfunc(bar[i])
You immediately know you have more options at your disposal to optimize that loop. Further, the compiler immediately knows it has more options. Maybe this is a good option for automatic unrolling or Duff's device. Maybe you wanna send this to a gpu shader, or use a macro to parallelize the loop. But it's a lot harder to get to that point if your starting point is the complected, stateful one.


Agreed. But can we stop using "premature optimization is the root of all evil". It has jumped the shark. I have more grief in my life because people adhere to this tenet. This is why we end up with the software equivalent of concrete airplanes. Ok it's your turn now. Make it fly!


Did you measure ? Because if you didn't measure you aren't optimising, you are just wanking.

And of course one reason we say premature is that most likely until the project is mostly finished you can't really measure because you don't have anything to measure.


Obviously measuring is the only way to optimize. Goodbye.


The addition based version should be hand vectorizable if you do the math to figure out the increment based on vector sizes.


Is this relevant to interpreted languages as well? I’m thinking perhaps Pythons bytecode could have similar optimisations.


TLDR: Due to loop-carried dependencies preventing parallelized execution: https://en.wikipedia.org/wiki/Loop_dependence_analysis#Loop-...

In the 2 additions version, computation of the next iteration depends on the results of the preceding iteration. In the multiplication version, the computations are independent for each iteration, enabling parallel execution (by SIMD and/or pipelined/superscalar execution).


you could try to prefill first cell with -(A+B), then each cell gets a+b+c and in a loop for (j=0; j<i; j++): (A+A)

but im to lazy to test that




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: