I find it pretty interesting that you see a lot of people say 'within an order of magnitude of C' when talking about performance for dynamic languages and JITed VMs, etc. But they rarely take this kind of thing into account. If you wrote this in Python, you'd be trying to get your implementation close to that original number - 1733.4 MFLOPS - but there's not much discussion on the real target being 17985.4 MFLOPS which needs very specific coding. I think a lot of our code we write today is a lot more inefficient than we admit to.
Yes. And for the same reason, that's why you sometimes have to do things in assembly. No compiler is going to go against its own grain when generating code. Yet, sometimes, that's what you must do to eke out the necessary performance.
As an example, while doing code for a 6809-based video game (see [1] for full war story), I had to use every single register on the processor, including the system-stack register and the direct-page register, to speed up a graphics routine. This meant that, while my code was running, the system effectively had no stack and couldn't use certain addressing modes. Further, it also meant that interrupts would corrupt whatever memory the stack register happened to be pointing to.
All these problems were solvable – and the pain we had to go through to solve them was, for us, absolutely worth the 70% speed-up it bought us – but no compiler writer is going to incorporate tactics this weird into a compiler's optimization logic. If you need to go "full weird," assembly is often the only way.
I agree that 17985.4 MFLOPS needs very specific coding.
But the 1733.4 MFLOPS solution is a really naive one. Data locality and cache friendliness are two of the first things taught in high performance classes, with an obvious trick : iterating using row-major order like in the second case.
This trivial optimization gives a 9419.8 MFLOPS throughput. I don't think "everyday code" like that is so inefficient given the maximum result which can be obtained (17985.4 MFLOPS).
Going on and off abstraction layers is not done enough. I like articles where the authors iterate/feedback through high level code and low level results. Saw some in clojure and was happy to have conscious notion of cpu cycles floating next to s-exps (of course I mean this comment to apply to any kind of abstraction layer).
Interestingly, Rust does not have the pointer aliasing hazard in the third example: the compiler ensures that all mutable array slices cannot overlap, unlike C, and so the introduction of the temporary should not be necessary. (Unfortunately we don't communicate that information to LLVM today, and we'd also have to optimize out the bounds checks, so more work is required...but from a language design perspective I think we're good.)
That's true, although Rust provides for sound enforcement (so if you try to alias the equivalent of two restrict pointers the compiler will catch it at compile time).
(It's also not technically in C++, although in practice it can be achieved through extensions.)
C++'s standard library has valarray, which is supposed to avoid aliasing effects, and is allowed to use expression templates and other tricks. Something like:
should, in theory, achieve the same kind of performance as the C version (but none of this is guaranteed, sadly). A smart implementation would actually end up calling BLAS routines, which is what libraries like Eigen do.
Which reminds me: does Rust have infrastructure that would allow something like expression templates to exist?
That is interesting... Can you point me to any information about Rust and how it avoids array aliasing? Is it because Rust disallows pointer manipulation with arrays?
It's a shame that C prevents these optimisations by default. I suppose that a quick runtime check could see if the arrays overlapped and could then switch to the faster loop style, but the compiler would have to decide whether the loop was important enought to be worth generating the two code paths for.
No, Rust has pointer arithmetic, though it's bounds checked with slices. The restriction is that two mutable references cannot point to the same location. This is enforced by the borrow check: http://pcwalton.github.io/blog/2013/01/21/the-new-borrow-che...
Bounds checked at runtime (unless you use the unsafe sublanguage). Without dependent types (which I feel would exceed our complexity budget) we can't do much better.
However, for simple iteration, if you use iterators idiomatically instead of C-style for loops then you won't incur any bounds checks.
Has anyone thought through how mut and &mut interact with each other in terms of aliasing (since they could definitely alias f(&mut x as mut u8, &mut x)), and also how this interacts with things like TLS and global variables?
Nah, I think we can handle it without undefined behavior by telling LLVM that `﹡mut` can alias anything. In effect, we would only tell LLVM that `&mut T` cannot alias `&mut T` if there is no `﹡mut T` in scope.
Of course, if you transmute stuff into `&mut T`, then you'd better make sure that those two `&mut T`s don't overlap. But `﹡mut T` should be fine.
Does the LLVM TBAA information work for non-function arguments? i.e. one can pull a `* mut` out of a `static mut` or from TLS inside a function, can we still tell LLVM about what this (possibly) aliases?
This is interesting, but the first comment in that site is apt:
> Meanwhile, the Fortran programmer writes y = dot_product (x, x) and moves on to the interesting bits. Plus, if auto-parallelization is on, or if this is in an OpenMP workshare section…
This is titled "optimizing loops in C" but this level of optimization is actually programming in assembly language, specifically x86 with SSE extensions.
The C programmer will actually just write y = cblas_sdot(n, x, 1, x, 1) and move on to the interesting bits.
However, someone had to implement dot_product and cblas_sdot at some point (either in the compiler or in the library), and they need to be rewritten from time to time for new architectures. More to the point, most programmers, most of the time, aren't just doing a dot product. They're doing some other more sophisticated computation for which these techniques may be quite relevant. The dot product is just a convenient example.
Agreed on both counts: someone had to write those library functions, and this was just an example.
However, the author starts by talking about a "holy war between C and Fortran" and then proceeds to write... well, assembly language using the C compiler. So the summary could be "assembly language can be made more efficient than Fortran, and C lets you coax the compiler into writing the assembly language you want". I guess this could be seen as a win for C, but I'm not so sure...
As the reply in the original article explains, it is discussing the implementation details of such a function (which could equally exist in a C library). Simply saying 'call a library function' doesn't help anyone understand what is going on behind the scenes.
I really enjoy reading tips like this. The mnemonic I use for looping over multi-dimensional arrays in C is: the right-most index should be the inner-most loop:
for (i = 0; i < M; i++) {
for (j = 0; j < N; j++) {
x[i][j] = ...;
}
}
It does but the heuristics for loop unrolling without any runtime data as basis is very difficult and thus very much a hit or miss affair (and missing is expensive) which is why no compiler I know of (GCC included) enables -funroll-loops or equivalent by default in any of the standard optimization levels (-On).
In GCC, the only option which enables -funroll-loops (apart from explicitly enabling it) is -fprofile-generate which is GCC's profile guided optimization.
The reason it enables -funroll-loops is that since it gathers runtime statistics during the profiling run it has enough information to accurately perform loop unrolling without risking performance degradation.
If you declare your pointers with the restrict keyword and tell the compiler to allow reassociation of floating-point additions, it will perform exactly the same optimizations, without any hand-holding from the programmer. In some C compilers, these are even the defaults.
At university, one of the exercises in the HPC class was to optimise a piece of Fortran code. I managed to beat the TA's code by other means, but missed one of his significant optimisations: manually unrolling a dot product. This was on Intel's Fortran compiler, though admittedly almost 10 years ago. (It did already support autovectorisation etc. at the time)
Anyway, goes to show you still need to look at the disassembly and do plenty of profiling with Fortran binaries too. No such thing as a free lunch.
I find the example is incredibly confusing. I expected this to be an optimization of a dot product or matrix multiplication or similar but as it stands it seems to be an element-wise square of a VECTOR_LEN * N matrix with subsequent accumulation of columns (not the most common operation in typical signal processing).
It should be also noted that this can be trivially implemented in direct SIMD using GCC/clang vector extensions, without having the compiler 'guess' the SIMD part, which btw would be also make the code much easier to read.
Any idea what optimization flags were used? It's rather strange that they're not reported. I would be surprised if GCC 4.8 was that far from optimal with -O3.
Do you say this as someone familiar with assembly and GCC? My usual guess would be that you can often hope for a 50% speedup in a tight loop by dropping from C to assembly, and that a 2x speedup over GCC is not uncommon.
The original author's code isn't available for this example, but I put together something I think is comparable. I may still have silly bugs, but here are my initial result on Sandy Bridge are something like:
icc 13.0.1 -03 -march=native -fno-inline wrong-loop: 1.35 s
icc 13.0.1 -03 -march=native -fno-inline right-loop: 0.78 s
icc 13.0.1 -03 -march=native -finline-functions wrong-loop: 0.22 s
icc 13.0.1 -03 -march=native -finline-functions right-loop: 0.22 s
gcc 4.8.0 -03 -march=native -fno-inline wrong-loop -fno-inline: 1.38 s
gcc 4.8.0 -03 -march=native -fno-inline right-loop -fno-inline: 1.14 s
gcc 4.8.0 -03 -march=native -finline-functions wrong-loop: 1.35 s
gcc 4.8.0 -03 -march=native -finline-functions right-loop: 1.14 s
There are all sorts of things I might be doing differently (or wrong), but I'm printing out a total-of-totals so I know it's at least going through the loops. It's possible that is a fast-math optimization, but I wouldn't be betting on GCC -O3 to be close to optimal.
I made a simple test
void nsum(float v, float acc, int n, int vc )
{
int j, i;
for(i = 0; i < n; i++)
for(j = 0; j < vc; j++)
acc[i] += v[j][i]v[j][i];
}
And then I tested the same function with a different declaration
void nsum(float * restrict * v, float * restrict acc, int n, int vc )
The version without restrict qualifier had 1.01s runtime. Version with restrict had 0.45s runtime. Both were compiled with identical flags (just -O3) using the ancient gcc 4.4.5. (vectorizer is enabled by default at O3 even in this version).
That's 2x speedup with a simple pointer definition.
Normally I'd use restrict and float pointers, but since I was trying to repeat what the original poster did, I used fixed arrays instead. Because of this, I did not see a difference with 'restrict'. But I might be missing something, or might have messed up with the array indexing. The generated GCC optimized function is 500 instructions long, and thus difficult to scan. I put my untested test code up here: http://pastebin.com/qB0DfkXN
As you can see it stores the dst[y] on each iteration. With function definition of:
void sum_of_squares_1(float dst[restrict ROWS], float src[restrict ROWS][COLS])
The disassembly becomes completely different. However the speed of the end result did not really change that much.
Could you throw objdump -d of the best icc output to pastebin? I'm interested to see what kind of code it produces.
> My usual guess would be that you can often hope for a 50% speedup in a tight loop by dropping from C to assembly
The problem with inline assembler is that it is almost untouchable by the optimizer. By adding some inline asm, you may inhibit a lot of optimization that could give better perf overall.
For this kind of tasks it is often a lot better to use intrinsics (e.g. xmmintrin.h for SSE) or use compiler extensions __attribute__((vector_size(16))) etc. This way you can utilize the CPU features you have available while still allowing the optimizer to do high level optimizations.
While there is lots to be said for the maintainability of intrinsics, I have found inline assembly to be significantly better for performance. And this is precisely because it inhibits the compiler from blindly performing 'optimizations' in the section of code you've already optimized. This thread offers an example and some numbers: http://software.intel.com/en-us/forums/topic/480004
I was under the impression that the parts of performance oriented programs which are typically converted to assembly are in essence small profiled hotspots like very tight loops, as such I doubt that there's any real performance to be had from high level optimizations in conjunction with that code as made possible by insintrics/extensions.
But I'm certainly no expert in this area, so take my opinion with a large grain of salt.
I get no significant difference with -Ofast for either icc or gcc. My code is still untested and quite possibly buggy, but I put it up here: http://pastebin.com/qB0DfkXN
Was anyone able to locate the speed of the Fortran implementation on the same hardware? It'd be quiet interesting to see a similar optimization approach done by someone experienced in Fortran with results from the simple implementation to more optimized ones.