GCC optimization flag makes your 64-bit binary fatter and slower

alexgartrell · on July 20, 2010

To put this in context, this particular bug makes your binary fatter and slower in the same way that eating a tic tac makes you fatter and slower. A single load/restore from memory through a slightly (and only slightly) slower path is really going to be blown away by every other operation in all but the most trivial program (imagine LOTS of recursion with very simple functions (no looping)). Do a single IO operation and you're really talking about nothing.

ars · on July 20, 2010

Except that this is intended to make things faster, but actually makes things slower.

If it had some other purpose, and a slight slowness was a side effect, then OK. But when the entire purpose is reversed that's a problem.

alexgartrell · on July 20, 2010

I agree that it's a bug and should be fixed. I just don't agree with the implication of the title that it's a super huge deal. I mean, all we're doing is swapping out mov for push and leave. The functionality is completely unaffected, and the slowdown is so minor that no one would ever notice without running a ridiculous microbenchmark.

Fix the bug, but don't make a mountain out of a mole hill.

ars · on July 20, 2010

Where does it say anything about functionality? It just says fatter and slower, and that's exactly correct.

I don't see any mountain either. Are you reading some emotional context I'm missing?

azim · on July 20, 2010

In most cases it should make it faster. Any type of optimization is kind of a tricky thing because you can't predict what the user is going to do in the field. Take sorting algorithms, for example. Quicksort can degenerate to O(n^2) if you lay your elements to be sorted out in a specific order which will always have bad pivots. I'm not sure if you are familiar with C and x86 assembly, but the test harness code was linked to in the article. It is a contrived example specifically designed to beat gcc's optimization.

alnayyir · on July 20, 2010

I take it you've not been around for very long, yeah?

GCC has had way worse than this before. You should've been there for the great 2.95 -> 3.0 switch.

That was comical.

This is a pretty minor flub compared to the other non-optimizing "optimizations" that GCC has.

ant5 · on July 20, 2010

Unless you are eating 1,000 tic-tacs a minute, in which case they really start to add up.

We're talking about the potential to slow down every single function by inserting unnecessary register save/restores. The compiler is operating at a level where this sort of thing matters, and will make a difference in your code.

alexgartrell · on July 20, 2010

If you're dominated by function calls, you're already wrong. Function calls are way expensive, because you're resetting like 4 registers and screwing over the instruction cache.

Give me a non-contrived example where this tiny amount of function call overhead makes an actual difference, and I'll eat my words.

ant5 · on July 20, 2010

If you're dominated by function calls, you're already wrong. Function calls are way expensive, because you're resetting like 4 registers and screwing over the instruction cache.

Give me a non-contrived example where this tiny amount of function call overhead makes an actual difference, and I'll eat my words.

A "non-contrived" example would require a whole-app benchmark; a micro-benchmark is already provided by the post author.

Nobody said the sky is falling (or that we're making a "mountain" out of it), but it is slower, and if it wasn't important to optimize function prologues/epilogues and register allocation, why exactly do we bother doing it at all?

To turn it around, give me a non-contrived example of a large code base where this "tiny" amount of function call overhead doesn't make an actual wall-clock difference, and I'll eat my words.

To put your complaint in context, you seem to be saying any micro-optimization of functional call overhead is not worth noting because it's noise amidst I/O, syscall overhead, or a million other things that are also expensive. A position which ignores the fact that the cheaper you make everything else, the more time you have left over for things you can't make cheaper.

anamax · on July 20, 2010

> Function calls are way expensive, because you're resetting like 4 registers and screwing over the instruction cache.

Function calls don't screw over the instruction cache.

Executing code at lots of different addresses screws over the instruction cache.

azim · on July 20, 2010

No, that's not really correct. The call itself in to the function and the return has the added cost of two extra instructions. But the trade off is that the function body itself now has an extra register to work with, so you should see an improvement in the execution speed of the function. The compiler is in fact operating at a level where this sort of thing matters, and in the real world function execution time will dominate call and return time.

phsr · on July 20, 2010

It's nice to see someone tell you that their benchmark is flawed and why. Most try to pass off their benchmark as the end-all-be-all measurement of whatever they are testing

JoeAltmaier · on July 20, 2010

Graphs were also flawed - scale grossly exaggerated to make the point. Why not zero-based vertical scale? See "How to Lie with Statistics".

masklinn · on July 20, 2010

If the scales were flawed, it's in defense of -fomit-frame-pointer: according to his bench, it reduces the cycle count by ~0.3% in 32b, but blows it up by nearly 30% (29.7) in 64b.

I have put both graphs on the same image and scaled them correctly, then extended the canvas to the whole scale and finally shrunk the graph back to the original 600px high, this is the result: http://imgur.com/yG3R9

64b on the left, 32 on the right, blue is without -fomit and red is with it. On the 32b graph, you can barely discriminate between with and without, whereas on the 64b graph you can very clearly see it.

If he lied, his fault is to have dismissed his own findings as less important than they are.

ice799 · on July 20, 2010

thanks man. i've also updated the graphs on my blog.

masklinn · on July 21, 2010

Still a small issue: the graphs aren't the same scale (64b goes 0-5; 32b goes 0-4)

ice799 · on July 20, 2010

grossly exaggerated? bro, i gave the raw numbers in a table above the graph. if i were trying to lie i wouldn't have given the numbers.

i took the graphs as-is from goog docs and put em in my blog post.

i just zero-based the vertical scale. you can close your book about lying with stats or whatever the shit.

froydnj · on July 20, 2010

A couple of points:

- It's hard to reproduce the benchmarking results, as source code for the benchmarks is not provided.

- The original bug report was for 32-bit code, not 64-bit code, as the post assumed throughout.

- If you compile the code given in the original bug report as 64-bit code with and without -fomit-frame-pointer, there's no difference in the generated code.

- It's not clear to me that the "potential pieces of code" in the article are actually generatable with real-world C code. Again, not having actual source code available hurts.

- You shouldn't be using -fomit-frame-pointer on 64-bit code anyway, as you don't need the frame pointer for debugging/unwinding purposes on x86-64 like you do on x86. If the poster had read the x86-64 ABI, this would have been apparent.

ice799 · on July 20, 2010

Chill, son.

1.) I can provide the codez. I'll add a link to the article.

2.) Yeah. If you read the article, it reproduces on 64bit code, too.

3.) Not true. Try gcc (Debian 4.3.2-1.1) 4.3.2. The version I mentioned that I used in my post.

4.) etc

5.) Read the article.

I'll add some more shit and reply to you again when its online. I need to eat breakfast and head to the office but I'll make you happy soon.

herdrick · on July 20, 2010

Chill, son

He was perfectly calm. No need for that. Let's keep things polite here.

masklinn · on July 20, 2010

Maybe you should provide both graphs on a full scale (from 0 to 4.8) to show just how little, in your benchmark, -fomit-frame-pointer brings to the table in 32b (under half a percent using your mean cycles count) versus how much is lost due to it in 64b (nearly +30% cycles)

ice799 · on July 20, 2010

done, refresh page and you should see em.

ice799 · on July 20, 2010

building the code in the bug report as a 64bit binary and various system information: http://gist.github.com/483494

and

testing harness, scripts to build it, and to run it: http://gist.github.com/483524 -- yes i was too lazy to make a makefile.

you will still need to construct some command line fu to separate the results into separate files so you can load it into whatever maths program you want.

froydnj · on July 21, 2010

Thanks for the information.

Your microbenchmark appears to be alignment sensitive. With your assembly code on my machine (quad-core 2.66 Core2 Quad) running for 250 tests I get:

  test1 usecs: avg 1.16759e+06 stddev 13917.6
  test2 usecs: avg 1.31382e+06 stddev 405.725

which are similar results to yours.

But if you add .align 8 right before the definition of test2 in the assembly file (i.e. make it be 8-byte aligned, just like test1), I get the following numbers:

  test1 usecs: avg 1.15972e+06 stddev 17004
  test2 usecs: avg 1.1264e+06 stddev 754.44

so the code that "doesn't use frame pointers" is actually slightly faster, as you might expect.

Additionally, if I simply modify your testcase to use 16-byte alignment, rather than 8-byte alignment, I get the following numbers:

  test1 usecs: avg 1.15895e+06 stddev 15764.7
  test2 usecs: avg 1.12657e+06 stddev 941.606

I think aligning both test functions by 8 bytes at least makes things fair, but you can see that minor changes in alignment can cause big changes.

You can see the assembly sources I used: http://gist.github.com/483840

FWIW, the code that uses movs rather than pushes and pops ought to be faster since (generally speaking for larger prologues and epilogues) you can execute a series of movs in parallel, whereas your pushes and pops are serialized, since they're all updating a common resource (the stack pointer). Empirical testing on benchmarks like SPEC2k has borne this out, both on x86 and x86-64. (You ought to be able to see this effect with gcc, depending on what cpu you use for the -mtune switch.) As you noted, this strategy carries a size penalty, since movs are somewhat larger than pushes and pops.

I'll also note that on my machine, with gcc saying it's:

  @nightcrawler:~$ gcc --version
  gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3

I get identical assembly for compiling the testcase from the PR with and without -fomit-frame-pointer (I should have noted the gcc version I was using, just as you did. My bad.) Furthermore, for:

  @nightcrawler:~$ gcc-4.3 --version
  gcc-4.3 (Ubuntu 4.3.4-10ubuntu1) 4.3.4

I also get identical assembly. On one of the servers at work, with:

  @nightcrawler:~$ ssh henry7 gcc --version
  gcc (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu4)

I get identical assembly. Finally, also at work, with:

  @nightcrawler:~$ ssh henry7 /usr/local/tools/gcc-4.3.3/bin/i686-pc-linux-gnu-gcc --version
  i686-pc-linux-gnu-gcc (Sourcery G++ 4.3-83) 4.3.2

which is a somewhat patched version of GCC circa 4.3.2, I get identical assembly. So with four different flavors of GCC, there's no difference on the testcase in the PR with and without -fomit-frame-pointer. I'd be willing to bet that there's no differences with 4.5.x and mainline GCC as well. It looks like Debian may just have a peculiar set of patches to its version of GCC.

EDIT: formatting fixes.

dfj225 · on July 20, 2010

Anyone doing timings in Linux at the cycle level might be interested in the PAPI library: http://icl.cs.utk.edu/papi/

stcredzero · on July 20, 2010

It sometimes pays to run your tests at different levels of optimization. There are optimizer and code generation bugs in compilers. Running your tests at different levels of optimization can detect some of these.

pmjordan · on July 20, 2010

There are optimizer and code generation bugs in compilers.

In my experience these are much, much rarer than the bugs in my code that are optimisation level dependent...

stcredzero · on July 20, 2010

Then there's a whole lot more reasons to test with different optimization levels.

pmjordan · on July 20, 2010

Definitely. Even better, have them run inside valgrind or a similar tool, as this type of bug is often laughably easy to locate that way, and sometimes devilishly difficult to find with more crude methods.

stcredzero · on July 20, 2010

When I was doing C/C++ we didn't have valgrind, or at least it wasn't widespread enough for me to hear about it.

oliveoil · on July 20, 2010

uhh, what is that disgusting thing on the picture near the top?

grandalf · on July 20, 2010

I believe it's a ferocious water bug:

http://www.google.com/images?um=1&hl=en&tbs=isch:1&#...

ars · on July 20, 2010

It's a Giant water bug. The ferocious thing is a joke.

ars · on July 20, 2010

The things stuck to its back are its mates' eggs. It carries them around till they hatch.

syncsynchalt · on July 20, 2010

http://magickcanoe.com/blog/2006/07/31/frogs-water-bugs-and-... original in context

aliguori · on July 20, 2010

The analysis is flawed because it merely takes into the account the fact that ebp is callee saved verses using a different register that is caller saved. But ultimately, the register still needs to be spilled somewhere so you're not changing the overall amount of work done.

Furthermore, it doesn't take into account the fact that by using ebp as a GP register, you've got an additional register to work with which eliminates additional spilling.

Basically, there's nothing to see here.

ice799 · on July 20, 2010

Spilling has nothing to do with this. GCC has the chance to pick another GPR that is caller saved but instead picks a callee saved register.

This decision increases the size of libc by 1% when compiled with -fomit-frame-pointer[1].

[1] http://sources.redhat.com/ml/libc-alpha/2010-07/msg00022.htm...