I'm disappointed it says literally nothing about auto-vectorization. I had to do an incredible amount of hand-holding to get GFortran 4.9's auto-vectorization to work worth anything. A few experiments with C code showed that GCC 4.9 was just plain bad at it compared to even Clang, much less ICC. If I work on that project again, I'll probably look into getting DragonEgg [1] to work so I can get decent optimization without contorting the code as much.
The OP mentions an important case with loads and stores of 3D data, but GCC needs to up its game in more ways than that.
Only loosely related, but I'm curious: What compiler optimizations have the biggest impact on scientific / floating point computation? Integer (audio/image) ops? With modern CPUs performing speculative execution, register renaming, and all the other magic they do, the CPU is acting like an optimizer in its own right. x64 is mostly just a byte code that gets JIT compiled at runtime. I'd be interested in seeing how much specific optimizations (including classic ones) contribute to speeding things up.
This depends entirely on the specific code you're talking about. If the compiler finds out a specific optimization is applicable for a certain block of code, then it uses it.
So your question cannot be answered. Only possible answer is "it depends". Another guess: "From nothing to an order of magnitude. Probably most cases about 5-20%."
Yeah, I realize it was an open-ended question. I'm just curious if one was starting a (C-like) compiler from scratch, where the biggest bang for the buck comes from. It seems like peephole optimizations are already handled at runtime by the CPU itself.
> What compiler optimizations have the biggest impact on scientific / floating point computation?
In my experience, auto-vectorization is the big one. Modern CPUs do 2-, 4-, or 8-wide operations, but it can be hard to convince the compiler to use them. Next is loop tiling, to keep things in cache(s) where possible. These are both hard to do by hand.
Loop interchange to get better locality is nice, but can be done by hand without much trouble by someone who understands computers.
Does anyone know if -O3 (or -O2?) -march=native should be enough to get reasonable optimization for running on the same cpu as the gcc host? Or are one better off tweaking options manually (note, I know that knowing the details of how gcc works will always be better than not -- I'm just wondering if -march=native is currently considered stable/"good enough" from reasonable value of "enough" ;-)
I think yes. A couple weeks ago I tested compilation for a project on Haswell and Sandy Bridge using various recent versions of CLang, GCC, and ICC to determine whether it was safe to use just "-O3 -march=native" in combination with "#include <x86intrin.h>" instead of more specific versions.
While my testing was far from rigorous, my conclusion was that this is now sufficient and acceptable to get appropriate platform-specific SIMD optimization.
You'll find some recent arguments for "-Ofast" instead of "-O3" or "-O2", but compiler versions that don't support this are still recent enough to common. Others occasionally argue that "-Os" is a better modern default, but I haven't found this to be true. Although more debatable, I'd suggest "-g -Wall -Wextra" also be part of the defaults compiler options for most projects.
It's probably fine. Back in the day (early 00s) I used Gentoo and I think pretty much everyone used march (because why not?) with no real negative side effects. It may be a little more buggy with uncommon or new architectures.
I did have the occasional weird breakage with -Os, though (once it broke make...). Anyways, -O2, -O3, and -Os should be pretty reliable as I imagine they are the most used optimization flags these days.
You generally don't have to tweak options manually unless a small increase in performance is very important to your application or you are running specialized programs that will benefit largely from a specific optimization. Remember -- most Linux distributions use pretty basic cflags and are reasonably performant.
Come to think of it, the Gentoo project has probably been useful in rooting out weird cflag bugs in GCC. :)
I remember that -flto sometimes adds few %'s to overall speed.
If you are doing a lot of floating point math you can check various modes. First try is always -Ofast which turns on -ffast-math flag. From the gcc page:
-ffast-math
Sets -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans and -fcx-limited-range.
This option causes the preprocessor macro __FAST_MATH__ to be defined.
This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.
I also remember that gcc started producing significantly faster code for my projects at around 4.7 or 4.8 version (can't remember which one).
`-ffast-math` is a double-edged sword. It can frequently optimize away stuff like NaN checks and the like, breaking code that uses NaN as a missing value. This, specifically, has bitten me before, but there are certainly other areas that it can bite you.
OTOH it gives the compiler to do a great deal of algebraic simplifications, including expression reordering. This probably will bring what the compiler can actually do more in line with what you think it should be able to do.
Basically, you need to test with it on if you're going to use `-ffast-math`. You might also have good luck with turning on a subset of the flags. For example, IIRC in the project with the NaN checks, using all of them except `-ffinite-math-only` fixed the problem in this case.
Some of them are obvious to turn on though. `-fno-math-errno` should be the default for most programs, if you ask me. I've never seen anybody check `errno` to see if their call to, e.g. `sqrt`, was invalid, and I hope I never do.
I'm so glad to see auto-vectorization happening more and more often. However, I wonder whether a language that had built-in support for primitive floating-point vector types (e.g., GLSL's vec3, vec4, mat3, mat4) could help the compiler with performing these sorts of optimizations.
I spent a lot of the past month improving the RenderScript (Android data-parallel compute, using C99 plus vectors) codegen to better work with the LLVM vectorizers, so I have a fair amount of experience with this exact question.
The vector types help in some ways--if you can use only vectors and do arithmetic on entire vectors only the SIMD codegen already works great--but they don't really help more than an intelligent vectorizer could. For example, the LLVM loop vectorizer can only handle loops where the induction variable changes by 1 or -1 each iteration. As a result, you couldn't set A[i], A[i+1], and A[i+2] in a for loop where i += 3 each iteration. If you could do that, you wouldn't really need vec3 in the first place. (also, the presence of any vector types whatsoever prevent any vectorization in LLVM right now, so...)
Another issue is that using vectors for storage and vectors for computation are very different things. I don't think anyone actually likes using vectors for computation, but vectors for storage make a lot of sense in some fields like image processing. However, as soon as you start operating on single channels, things get messy. Let's say you have a for loop where each iteration operates on a float4 as individual channels. Let's also say you want to turn each of those single channel operations into its own float4 operation, vectorizing across four iterations of the loop. Given most current SIMD ISAs, you're going to have to do four 32-bit loads of A[i].x, A[i+1].x, etc., then pack that into a vector, do your arithmetic ops, unpack the vector, and do four 32-bit writes to memory. Unsurprisingly, this is not particularly fast, and if you're doing one mul or FMA per channel, you shouldn't be vectorizing at all. This is why you see cost models in vectorizers (to prevent this sort of packing/unpacking from killing your performance when you enable your vectorizer) as well as why you see newer SIMD ISAs like AVX2 in Haswell including support for scatter/gather and permute.
The last issue is that it's trivial to break a vectorizer because of easy to overlook things like ailasing. Missing a single restrict will prevent vectorization if the compiler can't prove that pointers won't alias, for example (why do you think people still use fortran in HPC?). There's actually been a lot of great work here in LLVM over the past few months with new aliasing metadata for LLVM IR (http://llvm.org/docs/LangRef.html#noalias-and-alias-scope-me...), which is what I used to make the SLP vectorizer work with RenderScript in a way similar to ISPC (except in the compiler back end instead of the front end, because we don't even know the target ISA at the time the RS source is compiled to LLVM IR). I'll probably get the patch in AOSP in the next week or two if you want to keep an eye on that; it needs a newer LLVM version than what shipped in L and we're finishing up that rebase.
(honestly, I think that if you're trying to get good server CPU performance and you know exactly what CPU you're going to be using at compile time, you should be looking at ispc instead of doing SSE intrinsics yourself: https://ispc.github.io/ )
Fabulous comment, please keep posting more and sharing your knowledge/experience. And thanks for the re-reminder of 'ispc'. A couple tiny additions:
While loading each element with an individual 32-bit load usually negates the advantage of vectorization, if you are using 8-bit data, doing a single vector load and shuffling bytes within a 128-bit vector is really fast with AVX. And shuffling double- and quad-words within 256-bits with AVX2 can be a useful alternative even if you have to do it twice and combine. But even better is to rearrange your data layout to match the operations you are going to be doing.
AVX2 doesn't actually offer scatter support, although AVX-512 will. And AVX2 gather on Haswell normally isn't any faster than multiple individual loads, although it's no worse and the the expectation is that future generations will improve on this significantly. Also worth pointing out is that 256-bit YMM vectors are conceptually split in two, so apart from the slightly slower 32/64/128-bit permute instructions, it's not possible to move data from one 128-bit 'lane' to the other.
sorry, the desktop Intel SIMD ISAs are generally not something I use on a day-to-day basis, so I get them mixed up a lot. (LRBni was the last one I looked at for any serious length of time.)
The age-old argument between "low level has less abstraction, helping the compiler" versus "high level allows you to express what you actually mean, helping the compiler"
High level isn't really helping, it is giving the compiler room to either screw up or shine.
It is great if the compiler shines, but frustrating if it screws up.
And yes, that applies to other tools, too. TeX, for instance, most of the time does a great job, given few specific directions, but sometimes, it doesn't quite do what you think is possible. Then, you may have a long fight ahead to make it do what you want.
Intel's ISPC compiler is an alternative approach that in my experience is easy to code for and produces impressive results. Many CUDA-style algorithms port to it with mostly syntactic changes, and you can interleave CUDA-style code with the usual x86-friendly control flow and data structures.
The article mentions two: unpacking 24 bit values and matrix multiplication, both might be something you'd find in a software video (de)compressor or a computer game
Games, at least what I've seen, usually just don't use 3x(8|16|32) bits per pixel, if it's not needed as alpha. It's pretty rare to pack 24/48/96-bit pixels in memory. Traditionally, unaligned loads are just not worth it.
Sometimes packed formats were worth it. I remember doing RLE schemes for fast "alpha blending" (anti-aliased blitting really!) back in 2000 or so, before MMX or GPUs were common. The scheme I used had three different types of run lengths. Completely transparent, pre-multiplied alpha blended and completely opaque. 16-bit pre-multiplied values were in memory in 32-bit format like this: URGBU00A. Groups of 5 bits, except U denotes unused single bit. Frame buffer in RAM was in 15 bit format, 0RRRRRGGGGGBBBBB. With this scheme, alpha-blending needed totally just one multiply per pixel [1], compared to three with usual pre-multiplied alpha or six with normal alpha without pre-multiplication.
[1]: Frame buffer pixels were masked and shifted into 000000GGGGG000000RRRRR00000BBBBB, multiplied by A and shifted back in place. Because all values were 5 bit entities, no overflow (overlap) could occur in multiply, thus one multiply yielded 3 multiplied frame buffer pixel values, for R, G and B!
1. Image conversion (RGB structure to some other)
2. N-dimentional coordinates. (Normalize array of XYZ points)
3. Multiplication of vectors by constant matrix
People contribute to compilers because they need features or performance improvements, not "to kick that other compiler's ass". Sometimes (as here) contributions are from CPU vendors who want to make certain that compilers show off the features of their new hardware. There were tons of great performance impovements in every new version of GCC before clang, and there continue to be today. You just can't model an open source compiler the way you model some closed source word processor :(.
The real question that should be asked is whether, now with two compilers to contribute features into instead of one, whether the progress of either of the compilers is actually faster than the one previous one. Again: people aren't just not going to improve the compiler if there is only one of them, as improvements come from contributors who need the better compiler, not from some abstract notion of "have the bestest compiler". Having multiple open source projects in the same space is a detriment--it is a cost--that hopefully is justified by the benefits.
What happens in a world of "competition", is that now Intel has to ask whether it is valuable to add this feature to GCC or to clang; maybe they have time for both, but probably not. That this work ends up in one compiler and some other work ends up in a different compiler is unfortunate. It isn't even clear who is "competing" or what they are "competing" for. It certainly isn't the performance optimization itself: you can't credit the GCC team (whatever that might even mean) for that, as this work is from Intel. GCC is an open source project that is contributed to by a vast number of independent actors, not centrally developed by a small handful of people.
The argument for clang and LLVM being helpful is therefore not "competition" (which would make absolutely no sense), but instead "GCC has accumulated years of entrenched architecture that, in 2010 as opposed to 1980, we would design differently: LLVM is thereby designed in a way that is easier to modify; so, even though effort is now forked (meaning fewer contributions to each compiler), a smaller number of contributions will have a more powerful effect on the state of compilers". If this is true (and many think it is), then that's great, but it isn't "competition" in the way people normally consider that concept.
Stallman's reaction¹ to LLVM was noisy and possibly childish, but it gave the FSF another important goal: not only should it supply a "libre" compiler but it should also be worth using over the first alternative that comes along. That second goal is a moving target, unlike the first, and it supplies a real motivation to make the best compiler they can.
One thing that is not clear to me is if OOP code such as copy constructors will benefit from using the vector registers. Just having vector copy of objects may turn out to be a big difference.
Otherwise examples with array of size 4 elements are nice to show, but in most of these cases it is also easy to use intrinsics.
EDIT was looking at this for __builtin_mul_overflow which apparently are in clang already, for testing overflow of arbitrary types.