To put this in context, this particular bug makes your binary fatter and slower in the same way that eating a tic tac makes you fatter and slower. A single load/restore from memory through a slightly (and only slightly) slower path is really going to be blown away by every other operation in all but the most trivial program (imagine LOTS of recursion with very simple functions (no looping)). Do a single IO operation and you're really talking about nothing.
I agree that it's a bug and should be fixed. I just don't agree with the implication of the title that it's a super huge deal. I mean, all we're doing is swapping out mov for push and leave. The functionality is completely unaffected, and the slowdown is so minor that no one would ever notice without running a ridiculous microbenchmark.
Fix the bug, but don't make a mountain out of a mole hill.
In most cases it should make it faster. Any type of optimization is kind of a tricky thing because you can't predict what the user is going to do in the field. Take sorting algorithms, for example. Quicksort can degenerate to O(n^2) if you lay your elements to be sorted out in a specific order which will always have bad pivots. I'm not sure if you are familiar with C and x86 assembly, but the test harness code was linked to in the article. It is a contrived example specifically designed to beat gcc's optimization.
Unless you are eating 1,000 tic-tacs a minute, in which case they really start to add up.
We're talking about the potential to slow down every single function by inserting unnecessary register save/restores. The compiler is operating at a level where this sort of thing matters, and will make a difference in your code.
If you're dominated by function calls, you're already wrong. Function calls are way expensive, because you're resetting like 4 registers and screwing over the instruction cache.
Give me a non-contrived example where this tiny amount of function call overhead makes an actual difference, and I'll eat my words.
If you're dominated by function calls, you're already wrong. Function calls are way expensive, because you're resetting like 4 registers and screwing over the instruction cache.
Give me a non-contrived example where this tiny amount of function call overhead makes an actual difference, and I'll eat my words.
A "non-contrived" example would require a whole-app benchmark; a micro-benchmark is already provided by the post author.
Nobody said the sky is falling (or that we're making a "mountain" out of it), but it is slower, and if it wasn't important to optimize function prologues/epilogues and register allocation, why exactly do we bother doing it at all?
To turn it around, give me a non-contrived example of a large code base where this "tiny" amount of function call overhead doesn't make an actual wall-clock difference, and I'll eat my words.
To put your complaint in context, you seem to be saying any micro-optimization of functional call overhead is not worth noting because it's noise amidst I/O, syscall overhead, or a million other things that are also expensive. A position which ignores the fact that the cheaper you make everything else, the more time you have left over for things you can't make cheaper.
No, that's not really correct. The call itself in to the function and the return has the added cost of two extra instructions. But the trade off is that the function body itself now has an extra register to work with, so you should see an improvement in the execution speed of the function. The compiler is in fact operating at a level where this sort of thing matters, and in the real world function execution time will dominate call and return time.
It's nice to see someone tell you that their benchmark is flawed and why. Most try to pass off their benchmark as the end-all-be-all measurement of whatever they are testing
If the scales were flawed, it's in defense of -fomit-frame-pointer: according to his bench, it reduces the cycle count by ~0.3% in 32b, but blows it up by nearly 30% (29.7) in 64b.
I have put both graphs on the same image and scaled them correctly, then extended the canvas to the whole scale and finally shrunk the graph back to the original 600px high, this is the result: http://imgur.com/yG3R9
64b on the left, 32 on the right, blue is without -fomit and red is with it. On the 32b graph, you can barely discriminate between with and without, whereas on the 64b graph you can very clearly see it.
If he lied, his fault is to have dismissed his own findings as less important than they are.
- It's hard to reproduce the benchmarking results, as source code for the benchmarks is not provided.
- The original bug report was for 32-bit code, not 64-bit code, as the post assumed throughout.
- If you compile the code given in the original bug report as 64-bit code with and without -fomit-frame-pointer, there's no difference in the generated code.
- It's not clear to me that the "potential pieces of code" in the article are actually generatable with real-world C code. Again, not having actual source code available hurts.
- You shouldn't be using -fomit-frame-pointer on 64-bit code anyway, as you don't need the frame pointer for debugging/unwinding purposes on x86-64 like you do on x86. If the poster had read the x86-64 ABI, this would have been apparent.
Maybe you should provide both graphs on a full scale (from 0 to 4.8) to show just how little, in your benchmark, -fomit-frame-pointer brings to the table in 32b (under half a percent using your mean cycles count) versus how much is lost due to it in 64b (nearly +30% cycles)
testing harness, scripts to build it, and to run it: http://gist.github.com/483524 -- yes i was too lazy to make a makefile.
you will still need to construct some command line fu to separate the results into separate files so you can load it into whatever maths program you want.
But if you add .align 8 right before the definition of test2 in the assembly file (i.e. make it be 8-byte aligned, just like test1), I get the following numbers:
FWIW, the code that uses movs rather than pushes and pops ought to be faster since (generally speaking for larger prologues and epilogues) you can execute a series of movs in parallel, whereas your pushes and pops are serialized, since they're all updating a common resource (the stack pointer). Empirical testing on benchmarks like SPEC2k has borne this out, both on x86 and x86-64. (You ought to be able to see this effect with gcc, depending on what cpu you use for the -mtune switch.) As you noted, this strategy carries a size penalty, since movs are somewhat larger than pushes and pops.
I'll also note that on my machine, with gcc saying it's:
I get identical assembly for compiling the testcase from the PR with and without -fomit-frame-pointer (I should have noted the gcc version I was using, just as you did. My bad.) Furthermore, for:
which is a somewhat patched version of GCC circa 4.3.2, I get identical assembly. So with four different flavors of GCC, there's no difference on the testcase in the PR with and without -fomit-frame-pointer. I'd be willing to bet that there's no differences with 4.5.x and mainline GCC as well. It looks like Debian may just have a peculiar set of patches to its version of GCC.
It sometimes pays to run your tests at different levels of optimization. There are optimizer and code generation bugs in compilers. Running your tests at different levels of optimization can detect some of these.
Definitely. Even better, have them run inside valgrind or a similar tool, as this type of bug is often laughably easy to locate that way, and sometimes devilishly difficult to find with more crude methods.
The analysis is flawed because it merely takes into the account the fact that ebp is callee saved verses using a different register that is caller saved. But ultimately, the register still needs to be spilled somewhere so you're not changing the overall amount of work done.
Furthermore, it doesn't take into account the fact that by using ebp as a GP register, you've got an additional register to work with which eliminates additional spilling.