Optimizing loops in C for higher numerical throughput and for fun

RyanZAG · on Dec 15, 2013

I find it pretty interesting that you see a lot of people say 'within an order of magnitude of C' when talking about performance for dynamic languages and JITed VMs, etc. But they rarely take this kind of thing into account. If you wrote this in Python, you'd be trying to get your implementation close to that original number - 1733.4 MFLOPS - but there's not much discussion on the real target being 17985.4 MFLOPS which needs very specific coding. I think a lot of our code we write today is a lot more inefficient than we admit to.

tmoertel · on Dec 16, 2013

Yes. And for the same reason, that's why you sometimes have to do things in assembly. No compiler is going to go against its own grain when generating code. Yet, sometimes, that's what you must do to eke out the necessary performance.

As an example, while doing code for a 6809-based video game (see [1] for full war story), I had to use every single register on the processor, including the system-stack register and the direct-page register, to speed up a graphics routine. This meant that, while my code was running, the system effectively had no stack and couldn't use certain addressing modes. Further, it also meant that interrupts would corrupt whatever memory the stack register happened to be pointing to.

All these problems were solvable – and the pain we had to go through to solve them was, for us, absolutely worth the 70% speed-up it bought us – but no compiler writer is going to incorporate tactics this weird into a compiler's optimization logic. If you need to go "full weird," assembly is often the only way.

[1] http://blog.moertel.com/posts/2013-12-14-great-old-timey-gam...

pjmlp · on Dec 15, 2013

True, but you have many ways performance can be improved.

On this specific case it was developer helping the gcc optimizer. These type of optimizations could produce worse code in another C compiler.

Sometimes the code is already fast enough for the problem being solved, other times rewriting the algorithm could improve performance.

Only in cases where every ms counts, does helping the compiler really matter.

joosters · on Dec 15, 2013

How do you know we're talking about milliseconds here? These changes might save hours on a big calculation.

pjmlp · on Dec 15, 2013

Saving hours is the cumulative effect of every ms counts.

raphaelj · on Dec 15, 2013

I agree that 17985.4 MFLOPS needs very specific coding.

But the 1733.4 MFLOPS solution is a really naive one. Data locality and cache friendliness are two of the first things taught in high performance classes, with an obvious trick : iterating using row-major order like in the second case.

This trivial optimization gives a 9419.8 MFLOPS throughput. I don't think "everyday code" like that is so inefficient given the maximum result which can be obtained (17985.4 MFLOPS).

cliffbean · on Dec 15, 2013

I guess it is at least useful to teach people that performance isn't one dimensional. Most people don't measure performance in MFLOPS, for example.

Relative to the amount of code in the world there are vanishingly few loops which have this kind of 10x speedup just waiting for the right tuning.

agumonkey · on Dec 16, 2013

Going on and off abstraction layers is not done enough. I like articles where the authors iterate/feedback through high level code and low level results. Saw some in clojure and was happy to have conscious notion of cpu cycles floating next to s-exps (of course I mean this comment to apply to any kind of abstraction layer).

pcwalton · on Dec 15, 2013

Interestingly, Rust does not have the pointer aliasing hazard in the third example: the compiler ensures that all mutable array slices cannot overlap, unlike C, and so the introduction of the temporary should not be necessary. (Unfortunately we don't communicate that information to LLVM today, and we'd also have to optimize out the bounds checks, so more work is required...but from a language design perspective I think we're good.)

sharpneli · on Dec 15, 2013

C99 and later provide the restrict keyword which tells the compiler that the arrays do not overlap.

It pretty much solves the whole issue.

pcwalton · on Dec 15, 2013

That's true, although Rust provides for sound enforcement (so if you try to alias the equivalent of two restrict pointers the compiler will catch it at compile time).

(It's also not technically in C++, although in practice it can be achieved through extensions.)

pbsd · on Dec 16, 2013

C++'s standard library has valarray, which is supposed to avoid aliasing effects, and is allowed to use expression templates and other tricks. Something like:

    float dot(std::valarray<float> const& x)
    {
        return (x*x).sum();
        // Alternative: inner_product(begin(x), end(x), begin(x), 0.0f);
    }

should, in theory, achieve the same kind of performance as the C version (but none of this is guaranteed, sadly). A smart implementation would actually end up calling BLAS routines, which is what libraries like Eigen do.

Which reminds me: does Rust have infrastructure that would allow something like expression templates to exist?

pcwalton · on Dec 16, 2013

I believe you can use Rust traits to the same effect, although I haven't tried so you might hit stuff.

noselasd · on Dec 15, 2013

And before that, the alias issue in C compilers was one of the reasons why number crunching in Fortran could outperform C.

joosters · on Dec 15, 2013

That is interesting... Can you point me to any information about Rust and how it avoids array aliasing? Is it because Rust disallows pointer manipulation with arrays?

It's a shame that C prevents these optimisations by default. I suppose that a quick runtime check could see if the arrays overlapped and could then switch to the faster loop style, but the compiler would have to decide whether the loop was important enought to be worth generating the two code paths for.

pcwalton · on Dec 15, 2013

No, Rust has pointer arithmetic, though it's bounds checked with slices. The restriction is that two mutable references cannot point to the same location. This is enforced by the borrow check: http://pcwalton.github.io/blog/2013/01/21/the-new-borrow-che...

mcguire · on Dec 16, 2013

At compile time or run time?

pcwalton · on Dec 16, 2013

Bounds checked at runtime (unless you use the unsafe sublanguage). Without dependent types (which I feel would exceed our complexity budget) we can't do much better.

However, for simple iteration, if you use iterators idiomatically instead of C-style for loops then you won't incur any bounds checks.

mcguire · on Dec 17, 2013

I was just about to accuse you of dependent typing.

joosters · on Dec 15, 2013

Thanks!

dbaupp · on Dec 15, 2013

Has anyone thought through how mut and &mut interact with each other in terms of aliasing (since they could definitely alias f(&mut x as mut u8, &mut x)), and also how this interacts with things like TLS and global variables?

pcwalton · on Dec 15, 2013

I'm not sure about your first example—the formatting seems messed up—but the second example is solved with dynamic borrow checking.

dbaupp · on Dec 15, 2013

Oh, yes, HN replaced * with italics. I was talking about raw pointers, i.e. the second example is:

    f(&mut x as *mut u8, &mut x)

and there is no dynamic borrowing checking for that.

Would it become some form of undefined behaviour to have an *mut aliasing with an &mut?

pcwalton · on Dec 15, 2013

Nah, I think we can handle it without undefined behavior by telling LLVM that `﹡mut` can alias anything. In effect, we would only tell LLVM that `&mut T` cannot alias `&mut T` if there is no `﹡mut T` in scope.

Of course, if you transmute stuff into `&mut T`, then you'd better make sure that those two `&mut T`s don't overlap. But `﹡mut T` should be fine.

dbaupp · on Dec 16, 2013

Does the LLVM TBAA information work for non-function arguments? i.e. one can pull a `* mut` out of a `static mut` or from TLS inside a function, can we still tell LLVM about what this (possibly) aliases?

pcwalton · on Dec 16, 2013

Yes, TBAA works on any SSA value.

the_af · on Dec 15, 2013

This is interesting, but the first comment in that site is apt:

> Meanwhile, the Fortran programmer writes y = dot_product (x, x) and moves on to the interesting bits. Plus, if auto-parallelization is on, or if this is in an OpenMP workshare section…

This is titled "optimizing loops in C" but this level of optimization is actually programming in assembly language, specifically x86 with SSE extensions.

stephencanon · on Dec 15, 2013

The C programmer will actually just write y = cblas_sdot(n, x, 1, x, 1) and move on to the interesting bits.

However, someone had to implement dot_product and cblas_sdot at some point (either in the compiler or in the library), and they need to be rewritten from time to time for new architectures. More to the point, most programmers, most of the time, aren't just doing a dot product. They're doing some other more sophisticated computation for which these techniques may be quite relevant. The dot product is just a convenient example.

the_af · on Dec 15, 2013

Agreed on both counts: someone had to write those library functions, and this was just an example.

However, the author starts by talking about a "holy war between C and Fortran" and then proceeds to write... well, assembly language using the C compiler. So the summary could be "assembly language can be made more efficient than Fortran, and C lets you coax the compiler into writing the assembly language you want". I guess this could be seen as a win for C, but I'm not so sure...

joosters · on Dec 15, 2013

As the reply in the original article explains, it is discussing the implementation details of such a function (which could equally exist in a C library). Simply saying 'call a library function' doesn't help anyone understand what is going on behind the scenes.

eliteraspberrie · on Dec 15, 2013

I really enjoy reading tips like this. The mnemonic I use for looping over multi-dimensional arrays in C is: the right-most index should be the inner-most loop:

    for (i = 0; i < M; i++) {
        for (j = 0; j < N; j++) {
            x[i][j] = ...;
        }
    }

shared4you · on Dec 15, 2013

Doesn't gcc's "-funroll-loops" do the same thing? This looks more manual work to me, unrolling loops by hand.

gillianseed · on Dec 16, 2013

It does but the heuristics for loop unrolling without any runtime data as basis is very difficult and thus very much a hit or miss affair (and missing is expensive) which is why no compiler I know of (GCC included) enables -funroll-loops or equivalent by default in any of the standard optimization levels (-On).

In GCC, the only option which enables -funroll-loops (apart from explicitly enabling it) is -fprofile-generate which is GCC's profile guided optimization.

The reason it enables -funroll-loops is that since it gathers runtime statistics during the profiling run it has enough information to accurately perform loop unrolling without risking performance degradation.

pjmlp · on Dec 15, 2013

Yeah nice exercise, except Fortran compilers get to optimize the loops without requiring so much help from the developer.

stephencanon · on Dec 15, 2013

If you declare your pointers with the restrict keyword and tell the compiler to allow reassociation of floating-point additions, it will perform exactly the same optimizations, without any hand-holding from the programmer. In some C compilers, these are even the defaults.

pmjordan · on Dec 15, 2013

At university, one of the exercises in the HPC class was to optimise a piece of Fortran code. I managed to beat the TA's code by other means, but missed one of his significant optimisations: manually unrolling a dot product. This was on Intel's Fortran compiler, though admittedly almost 10 years ago. (It did already support autovectorisation etc. at the time)

Anyway, goes to show you still need to look at the disassembly and do plenty of profiling with Fortran binaries too. No such thing as a free lunch.

rsp1984 · on Dec 15, 2013

I find the example is incredibly confusing. I expected this to be an optimization of a dot product or matrix multiplication or similar but as it stands it seems to be an element-wise square of a VECTOR_LEN * N matrix with subsequent accumulation of columns (not the most common operation in typical signal processing).

It should be also noted that this can be trivially implemented in direct SIMD using GCC/clang vector extensions, without having the compiler 'guess' the SIMD part, which btw would be also make the code much easier to read.

poulson · on Dec 15, 2013

Any idea what optimization flags were used? It's rather strange that they're not reported. I would be surprised if GCC 4.8 was that far from optimal with -O3.

nkurz · on Dec 16, 2013

Do you say this as someone familiar with assembly and GCC? My usual guess would be that you can often hope for a 50% speedup in a tight loop by dropping from C to assembly, and that a 2x speedup over GCC is not uncommon.

The original author's code isn't available for this example, but I put together something I think is comparable. I may still have silly bugs, but here are my initial result on Sandy Bridge are something like:

  icc 13.0.1 -03 -march=native -fno-inline wrong-loop: 1.35 s
  icc 13.0.1 -03 -march=native -fno-inline right-loop: 0.78 s
  icc 13.0.1 -03 -march=native -finline-functions wrong-loop: 0.22 s
  icc 13.0.1 -03 -march=native -finline-functions right-loop: 0.22 s

  gcc 4.8.0  -03 -march=native -fno-inline wrong-loop -fno-inline: 1.38 s
  gcc 4.8.0  -03 -march=native -fno-inline right-loop -fno-inline: 1.14 s
  gcc 4.8.0  -03 -march=native -finline-functions wrong-loop: 1.35 s
  gcc 4.8.0  -03 -march=native -finline-functions right-loop: 1.14 s

There are all sorts of things I might be doing differently (or wrong), but I'm printing out a total-of-totals so I know it's at least going through the loops. It's possible that is a fast-math optimization, but I wouldn't be betting on GCC -O3 to be close to optimal.

sharpneli · on Dec 16, 2013

Did you use restrict?

I made a simple test void nsum(float v, float acc, int n, int vc ) { int j, i; for(i = 0; i < n; i++) for(j = 0; j < vc; j++) acc[i] += v[j][i]v[j][i]; }

And then I tested the same function with a different declaration void nsum(float * restrict * v, float * restrict acc, int n, int vc )

The version without restrict qualifier had 1.01s runtime. Version with restrict had 0.45s runtime. Both were compiled with identical flags (just -O3) using the ancient gcc 4.4.5. (vectorizer is enabled by default at O3 even in this version).

That's 2x speedup with a simple pointer definition.

nkurz · on Dec 16, 2013

Normally I'd use restrict and float pointers, but since I was trying to repeat what the original poster did, I used fixed arrays instead. Because of this, I did not see a difference with 'restrict'. But I might be missing something, or might have messed up with the array indexing. The generated GCC optimized function is 500 instructions long, and thus difficult to scan. I put my untested test code up here: http://pastebin.com/qB0DfkXN

sharpneli · on Dec 16, 2013

At least on this ancient version of gcc restrict helps even with the fixed sized array argument.

Without it the code of sum_of_squares_1 is as following:

  400913:       f3 0f 11 07             movss  %xmm0,(%rdi)
  400917:       f3 0f 10 48 34          movss 0x34(%rax),%xmm1
  40091c:       f3 0f 59 c9             mulss  %xmm1,%xmm1
  400920:       f3 0f 58 c8             addss  %xmm0,%xmm1
  400924:       f3 0f 11 0f             movss  %xmm1,(%rdi)
  400928:       f3 0f 10 40 38          movss  0x38(%rax),%xmm0
  40092d:       f3 0f 59 c0             mulss  %xmm0,%xmm0
  400931:       f3 0f 58 c1             addss  %xmm1,%xmm0
  400935:       f3 0f 11 07             movss  %xmm0,(%rdi)
  400939:       f3 0f 10 48 3c          movss  0x3c(%rax),%xmm1

As you can see it stores the dst[y] on each iteration. With function definition of: void sum_of_squares_1(float dst[restrict ROWS], float src[restrict ROWS][COLS]) The disassembly becomes completely different. However the speed of the end result did not really change that much.

Could you throw objdump -d of the best icc output to pastebin? I'm interested to see what kind of code it produces.

nkurz · on Dec 16, 2013

icc -fno-alias -Wall -std=c99 -finline-functions -Ofast -march=native loop-optimization.c -o loop

http://pastebin.com/qjEPy6Y0

Late night here in California. Good night!

exDM69 · on Dec 16, 2013

> My usual guess would be that you can often hope for a 50% speedup in a tight loop by dropping from C to assembly

The problem with inline assembler is that it is almost untouchable by the optimizer. By adding some inline asm, you may inhibit a lot of optimization that could give better perf overall.

For this kind of tasks it is often a lot better to use intrinsics (e.g. xmmintrin.h for SSE) or use compiler extensions __attribute__((vector_size(16))) etc. This way you can utilize the CPU features you have available while still allowing the optimizer to do high level optimizations.

nkurz · on Dec 16, 2013

While there is lots to be said for the maintainability of intrinsics, I have found inline assembly to be significantly better for performance. And this is precisely because it inhibits the compiler from blindly performing 'optimizations' in the section of code you've already optimized. This thread offers an example and some numbers: http://software.intel.com/en-us/forums/topic/480004

gillianseed · on Dec 16, 2013

I was under the impression that the parts of performance oriented programs which are typically converted to assembly are in essence small profiled hotspots like very tight loops, as such I doubt that there's any real performance to be had from high level optimizations in conjunction with that code as made possible by insintrics/extensions.

But I'm certainly no expert in this area, so take my opinion with a large grain of salt.

nkurz · on Dec 16, 2013

Jarek (the author) fixed the link and his (very clean) code is available again at: http://www.lshift.net/wp-content/uploads/2013/10/ve.c

gillianseed · on Dec 16, 2013

Could you try with -Ofast which enables -ffast-math and post the results?

nkurz · on Dec 16, 2013

I get no significant difference with -Ofast for either icc or gcc. My code is still untested and quite possibly buggy, but I put it up here: http://pastebin.com/qB0DfkXN

quentusrex · on Dec 16, 2013

Was anyone able to locate the speed of the Fortran implementation on the same hardware? It'd be quiet interesting to see a similar optimization approach done by someone experienced in Fortran with results from the simple implementation to more optimized ones.