Switching off the compiler optimizer may just "hide" the bug because memory accesses are reordered in a way that makes the bug (for instance a memory corruption or some allocation problem) less likely to trigger a segmentation fault.
To think that every time you have a bug that is suppressed switching off the optimizer you found a compiler bug is not a good idea. 99.99999% of the times the problem is in your code.
Experienced programmers usually will continue to think the bug is in their own code unless they can prove otherwise.
If the program works in some compiler optimization levels and not others, then think about what the optimizer is doing and how this may change the circumstances that the bug may appear. I agree that it is probably a memory corruption issue and that by turning off optimizations, you are hiding the sympton and not fixing the bug.
I think there should be a law that states: if you use a language like C or C++, you must ensure it compiles cleanly with all warnings turned on AND runs without error under a tool like Valgrind.
There are simply too many places where bugs may creep in to leave it to chance. The tools exist - use them!
Very good advice. I assume my C code does have memory blunders until I have run extensively through valgrind, after which I might begin to believe any other analysis I have done that suggests the code is correct.
I also tend to test a build linked with gcc’s mudflap:
gcc -g -fmudflap -lmudflap
Your program will run much faster than under valgrind. I have had bugs that have been missed by valgrind but caught with mudflap and vice-versa. Don’t try to link with mudflap and run under valgrind at the same time though, valgrind won’t work.
I worked at a shop where I wrote C for almost 2 years and we ran into this case twice. It was a compiler bug in the optimizer, and only happened when it tried to also optimize the way specific structures were laid out in memory. Using the zero-index array at the end of a struct to get a pointer to the following buffer in this case caused the offset to be wrong and we were over running our buffer.
Experienced programmers usually will continue to think the
bug is in their own code unless they can prove otherwise.
s/Experienced/Mature/ Because ESR has plenty of experience. The kind of reasoning in the OP comes from arrogance and the narcissism of valuing one's self-concept as a competent individual over getting the job done efficiently.
Actually, that's pretty much accurate. I initially didn't even realise this was written by ESR and just assumed the author was very inexperienced. ESR however should know better, though to be honest I don't even know if the guy still writes code. I hate to stoop to ad hominem attacks, but re-reading it knowing that it's him I get the impression this was a petty reaction to some feud with a GCC dev. In any case I'd love to know who the hell is upvoting this "story".
Yes, when I started reading, my first thought was, "wow, sounds like someone pretty inexperienced." After I noticed it was written by ESR, that changed to, "wow, sounds like someone pretty arrogant."
The fact of the matter is that a compiler like gcc is used by thousands (tens of thousands? more?) of people almost daily. Usually you have to be doing some pretty crazy stuff to find a bug in it. Bugs that go away when you turn off optimizations are usually either race-condition or memory-access related.
Or uninitialised local variables, which are affected by the difference in register allocation, but really you should be enabling the corresponding warning.
Turning off the optimizer can also hide memory barriers you forgot you'd need, by doing extra loads or stores rather than keeping an outdated value in a register.
That said, I have seen exactly one optimizer bug that I know of. Back in 1993, Borland C++ completely omitted one of my inline destructors from the binary. I had to review the assembly to convince myself I wasn't imagining things.
I've seen my fair share of compiler (and even assembler) bugs, but they were almost all in customised, proprietary compilers for game consoles. Those get nowhere near the number of eyeballs that GCC does. 99 times out of a 100, weird bugs are my own stupid fault. The ratio is even higher for multithreaded code.
Well, the percentage is a way of saying: You do not encounter a compiler bug in day-to-day-programming. 'four compiler bugs in 12 years' is just another way of saying: You do not encounter a compiler bug in day-to-day-programming.
(Note the 'day-to-day'. Day-to-day-programming and 'all programming you do in x*10 years of C-programming' are very different sets of code written)
I once wrote some MPI code for class. It ran properly at -O0, though the compiler warned of a variable that was declared but never used. Compiling at a level other than -O0 or removing that variable declaration from the source code caused the program to segfault immediately. It turned out to be a memory error somewhere else in the program (I forget exactly what, but it was over my misunderstanding of some part of the message passing calls).
It depends on the context as well. It is much more likely to find a bug in gfortran for solaris sparc than gcc on x86 linux, for example. The only time I keep the compiler bug in mind is when working with gcc 4.x compiler on windows.
I don't have the experience of ESR, but I find the advice a bit dangerous if taken as a general one. Especially, the idea that a heisenbug is often caused by a compiler: most likely, the heisenbug is not a heisenbug at all, but just less visible depending on the compiler flags. That was the case for the vast majority of "heisenbugs" I have encountered in C.
While it's not an everyday occurrence, it happens a lot more often than you would believe. In Linux kernel, there is actually official list of compiler versions that produce faulty code.
My experience goes in a different direction: when you see different behavior when compiled with and without optimization, suspect a memory allocation error or overrun. While compiler optimization bugs do exist, I've more often found that the problem is real.
If you are working with your own code and care if it works:
1) Turn on all compiler warnings
2) Change your code so it compiles clean
3) Run under Valgrind (or equivalent).
4) Address all reported errors, specifically
whitelisting them if necessary.
5) If you find a bug, don't stop until you've found
the cause. You're done when you understand what
caused the bug to appear, not when the symptoms go away.
6) Use open source tools, since otherwise you'll be
tempted to blame some unspecified 'bug in the compiler'.
(not that ESR would be using any other)
7) If it is a compiler bug, report it, along with
the smallest test case you can generate.
This is definitely along the right path. One proviso that I'd note: Most of the time in my own debugging when I've run into something that goes away at a different level of optimization it's uninitialized variables / memory. Valgrind is the shizzle.
It's seems strange me to instantly suspect that your compiler's optimizer is at fault. Which is more likely: you've found a bug in a compiler used the world over, or you've screwed up memory access?
Since C makes it easy to overrun memory, it's pretty easy to make horrible mistakes and have those have seemingly random consequences.
The fact that the bug changes when you change optimizer settings, add trace statements, or add debug code would make me suspect memory corruption first.
In fact, I think it's a good assumption to always begin suspecting your own code.
I have an example of a bug I found that looked like an optimization bug but was really a user bug. Someone had the bright idea of sprintf-ing to a string and using the string itself as one of the format arguments to the sprintf call. When compiled with gcc the code would still run fine, but when compiling with -O1 the string would end up garbled. The problem (and advantage, performance-wise) of C is that most of the time you do something wrong, the behavior is undefined, whereas most higher-level languages will spend the CPU cycles to protect you from yourself.
Another example I can't remember the details of, but it was related to the fact that gcc adds code to zero-initialize your stack on first access unless optimizations are turned on. Code that checked for null pointers worked fine until optimizations were turned on, at which point it was discovered that a variable was being used uninitialized.
Except multi-thread programming, I didn't find any compiler's optimizer related bugs. The truth is, most heisenbugs I found are somewhat related to memory access. It is just too risky to assume it is a compiler bug. I always think in the other way: unless you can prove (by generating assembly code and a possible scenario), the bug is in your code.
-O3 is indeed riskier. All experienced embedded systems developers know this: embedded compilers tend to be much buggier than compilers for desktop platforms. But desktop compilers are buggy too. Over the last couple years my group has reported 190 bugs to compiler development teams. A lot of these bugs turn up only at higher optimization levels. If you search on my email address "regehr@cs.utah.edu" as bug reporter in either GCC or LLVM's bugzilla, you can see plenty of examples.
If he's still teaching it, I recommend his advanced embedded systems class to anyone at utah.edu who wants to gain practical experience with such compiler errors.
It was actually this class that motivated the whole compiler bug-finding project. The quality of the average embedded compiler is appalling, students trip on codegen bugs all the time.
Of course as many people are pointing out in this thread, most of the time the compiler is not to blame when changing optimization options changes program behavior.
Embedded compilers are much worse. You have to pick and choose which stable version of gcc-4.x you can safely use. My AVR projects have been broken by compiler changes.
But it's much more than the optimizer. Even code generation at -O0 can be broken by assumptions about alignment, insn size, etc. This usually happens when you're using a very new or very old part and the gcc developers make assumptions based on their limited dev board setups.
All appreciation should be paid to those gcc developers as it is a very difficult job they do for free. Thanks!
I know little about C compilers, but if -O3 resulted in segfaults while other settings did not, how is it not riskier? Are the different optimization levels independent such that this bug could have appeared anywhere, but just happened to be in one of the strategies used by -O3?
It's not quite the same as the 'heisenbug' but I've seen a couple of cases over the years where a mysterious problem was resolved by moving to the latest update of the C/C++ runtime. It's weird how many enterprises are fine with running years behind on maintenance on that.
I found an optimizer bug once. I can't remember the exact circumstances, but it had to do with some fancy inline incrementing I was doing. The very concept that I'd found a bug in somebody else's code floored me.
The terminally curious may download a file containing the assembler output, and the C source, of the offending file from http://www.hercules-390.org/esamebug.zip . This corresponds to revision 5627 of the Hercules emulator as found in the Subversion repository at svn://svn.hercules-390.org/hercules/trunk . The emulator itself is at http://www.hercules-390.org .
The routine is in the generated assembler as z900_load_multiple_long.
The test has only been shown to fail on Mac OS X Snow Leopard with gcc 4.2.1. I haven't been able to make it fail on any other platform. Because of the code involved, I suspect it won't fail at all on anything but 32-bit Intel.
I've never run valgrind...it'll be interesting to see just what it does to Hercules execution speed. mudflap, too. Getting that built into the code might get even more interesting.
To think that every time you have a bug that is suppressed switching off the optimizer you found a compiler bug is not a good idea. 99.99999% of the times the problem is in your code.
Experienced programmers usually will continue to think the bug is in their own code unless they can prove otherwise.