This is a variation of an unfixed 2004 GCC bug. Clang detects the issue when the right warnings are enabled. Those warnings are disabled currently. A developer is auditing and fixing all such instances found by Clang so that they can reenable those warnings.
This is not a correct interpretation
(I worked on CCP and related optimizations for years :P)
It does in fact prove that it is zero, because it is the meet of lattice values (undefined, 0).
You can actually solve this particular case by interpreting phi nodes differently than CCP does. CCP does not generally care about what blocks things occur in - it doesn't have to, all definitions dominate all uses, so it is safe to evaluate things in any order. The only thing you stand to lose is optimality because things only go one direction on the lattice (to ensure that it reaches a fixed point).
However, in this particular case, you can see if that if you treated the phi nodes as the equivalent form (assignments occurring on the edge of the previous block), and then processed the loop blocks in dominance order, it is obvious the use is uninitialized.
It is only when looking at phi nodes as an unordered black box (as ccp does) that you get this particular variant of the issue.
This is, of course, just a very cheap form of flow sensitivity.
You can also get the same effect by using SSI form in a lot of cases.
Most compilers do not bother doing real expensive flow sensitive analysis to try to analyze stuff like this, because almost everything you can do comes with its own fun source of false positives and negatives.
I had no idea that the meet/join operations are used in compiler analysis, and I'm curious to find out more. Where can I find out how meet and join are defined on C values, in the context of GCC?
Just curious. I’m learning about compilers and would be interested in which compilers most make the trade-off in favor of optimal code generation over compilation speed.
I'm not sure about C, but I know of a few for other languages.
Some compilers perform whole program optimisation, rather than separate compilation of components/modules. This takes longer, but can find some extra optimisations (e.g. if a module provides some functionality, but it's not used in the resulting program, it can be deleted as dead code). MLton does this for StandardML ( http://mlton.org ) and Stalin does this for Scheme ( https://en.wikipedia.org/wiki/Stalin_(Scheme_implementation) )
At the very extreme end is the idea of superoptimisation ( https://en.wikipedia.org/wiki/Superoptimization ). Rather than translating the given code in some way, like a compiler, a superoptimiser treats the given code like a specification or test suite. It performs a brute-force search through the space of all possible programs, looking for any that match the specification. This can find truly optimal code, but it's ridiculously slow. So far its only real-world uses are to find small "peephole" (find/replace) optimisations that can be added to real compilers like LLVM.
A supercompiler can be thought of as running the given code at compile-time. If we know the values of all the functions and variables in a given piece of code, we can just run it to find out what the answer is. Interestingly, we can also "run" code involving values we don't know: we just pass those values around as opaque black-boxes. This is really useful for collapsing layers of indirection, resolving dynamically dispatched targets, baking-in configurable parameters, etc. The downside is that it can lead to bloated executables. Roughly speaking, supercompilation replaces a few long paths containing many conditionals, with many short paths containing fewer conditionals. It also specialises general-purpose functions into many single-use functions which have particular arguments baked-in.
What I don't understand is why it doesn't optimize away the check on (inode->i_nlink), since it is assuming the code in the dependent block always runs.
It has to be assuming that, because it is the only path that doesn't lead to undefined behavior.
The optimizer isn't required to eliminate code paths encountering dynamic UB; it is allowed to. It also doesn't have to be consistent in its "assumptions", although inconsistency may be a sign of missing efficiency somewhere
> The optimizer isn't required to eliminate code paths encountering dynamic UB; it is allowed to.
If I understood correctly, Linus had some colorful thoughts[1] on that:
If you know something is undefined, you warn about it. You don't silently generate random code that doesn't match the source code just because some paper standard says you "can".
Also, it is common to use bracketing with unusual
characters to signify emphasis. The asterisk is most
common, as in “What the *hell*?” even though this
interferes with the common use of the asterisk suffix
as a footnote mark. The underscore is also common,
suggesting underlining
Back in the days of usenet, and way before markdown, we always used bold, /italic/, and _underline_. You were just supposed to mentally interpret them in your head.
Edit: Ironically, HN is translating my asterisk-bolds into italic.
They were always meant for emphasis, which takes on variable definitions over time. Terminals were bad at italics due to character box layouts and so the literary approach of emphasis through italics mutated into emphasis through underlines. Now that we’re all free of monospaced terminal grids, it looks like we’re drifting back to emphasis through italics.
Aside: static analysis is a great initial effort but to really get confidence you should use techniques like sanitizers (MSan in this case) or valgrind. (This is general advice, not always applicable to kernels).
You should really use both, however they are only a tool: I don't have any numbers but I imagine Sanitizers have a good detection rate, but one should still write code defensively. There are too many horror stories, be they lethal or expensive, to be risky about memory.
Not a kernel developer: Can you test Kernel code with a sanitizer (if that sanitizer doesn't work in a compiled kernel) by compiling into and using the code as blob in userspace code with mocks (Or whatever, unless it's just utility that doesn't need anything other than an interface)?
I'm not sure what this means. You should not write your code any differently regardless of whether you used a sanitizer to help gain confidence in your implementation.
You should not use sanitizers in production (at least not the current generation).
> Can you test Kernel code with a sanitizer
In fact you can! Both UBSan [1] and ASan [2] are in use to root out bugs in linux.
Linux has support for kernel ASAN and I think UBSAN too. User mode linux is a also thing and does roughly what you described, and I think you can run it under valgrind too.
To be fair, UML runs at such a low level (raw syscalls rather than calls to, for instance, malloc and free) that it really limits what valgrind can tell you.
As far as I know Kostya he started by improving Helgrind as valgrind plugin in 2008: the first ThreadSanitizer (as project data-race-test) https://storage.googleapis.com/pub-tools-public-publication-... which was a harder problem than oob checks.
But it was still too slow, and they started with tsan and asan as clang plugin. Years later they ported it to gcc. gcc started then becoming more plugin friendly.
A simple way to avoid this problem is prohibit, in that code base, local variable declarations without initializers. If the initializer truly was unneeded the compiler would be able to eliminate it in almost all cases. Risking undefined behavior for hypothetical microspeedups is just dumb. If a case occurs where leaving out the initializer is both useful and safe then the initializer could be removed there, after deliberate review.
It look like a bug to me, since inode->i_nlink may be zero, and then ret is uninitialized (optimizing the final "return ret" to "return 0" should be safe (but is not necessarily worth it, depending on the target instruction set and ABI), although if warnings are enabled, it seem like the warning should still be displayed).
The compiler is allowed to do that.
Since it would be undefined behavior to read the undefined variable, the compiler gets to pick whatever value it wants for it.
So after the if, the compiler sees that ret is either 0, or an undefined value.
It picks 0 as the undefined value since that's a good optimization, so ret is now always 0, and (in this case) there is no warning.
> It picks 0 as the undefined value since that's a good optimization.
No it isn't; it costs an extra instruction to initialize that register or memory location to zero.
How C compilers optimize situations involving uninitialized variables is by leaving those variables alone. They do so with the standard's blessing, which grants them that uninitialized automatic variables are "indeterminately-valued" and that if the program uses an indeterminate value, its behavior is undefined. Correctly written programs that perform their own delayed initialization by later assignment benefit from the fact that the compiler didn't add wasteful code to initialize those objects, only to have that compiler-generated initial value be overwritten.
> ret is now always 0, and (in this case) there is no warning.
It's not correct for diagnostics to pertain to versions of the program that were logically altered by the compiler, rather than to the original program. It may be ISO-conforming, only because ISO C doesn't require those diagnostics in the first place, let alone requiring them to be reliable.
>No it isn't; it costs an extra instruction to initialize that register or memory location to zero.
If you're talking about the "xor eax, eax" that was emitted, then absolutely not. GCC is doing the most efficient thing possible with eax here.
That xor is not inserted because GCC kindly initializes the variable for you to save you from your mistakes, but because it has to initialize eax before returning (look, there's a function call before in the code example).
And after deciding the return value to be 0 it knows "xor eax, eax" is always going to be faster than actually loading the real value of "ret". (In fact it's a zero-idiom, even a NOP instruction is slower than that!)
What I was really trying to talk about is the SSA semantics and not the actual emitted code.
Now in fairness I'm not familiar with GCC's IR, but I'm going to blithely assume what GCC does with unitialized variables somewhat ressembles LLVM's undef semantics.
When I say that picking 0 is a good optimization, I mean that at an SSA level, the compiler is going have a Phi between Undef and the range [0, 0].
If by "leaving those variables alone" you mean that the compiler should have returned Undef as the result of that Phi, that would be a horrible miscompilation! And of course picking something else than 0 would just be silly.
>It's not correct for diagnostics to pertain to versions of the program that were logically altered by the compiler, rather than to the original program.
Agreed. As a user, I am not happy about this lack of diagnostics and I'm happy to call that a bug if you want. And furthermore, I don't think performing the optimizations above is fundamentally incompatible with good diagnostics. Probably just hard to implement for GCC.
Disclaimer: I am not actually a compiler engineer and this probably contains mistakes.
> GCC is doing the most efficient thing possible with eax
here.
It seems that the most efficient thing with %eax is not to mention it in an instruction.
There may be some quirk/feature of modern Intel processors that clearing a register will dis-entangle it from considerations of prior hazards. So that is to say, when we execute 'xorl %eax, %eax', the processor knows that any prior value in `%eax` is no longer required by subsequent code. A new lifetime has begun for that register. Therefore, there is no need to execute a pipeline stall to wait for that prior value, if it so happens that that prior value is not yet ready. Internally to the processor, the ISA-level register `%eax` can be assigned to a different internal register at that point, under the discipline of "register renaming".
In the absence of any such hardware considerations, the clearing of %eax is pure waste.
> it has to initialize eax before returning
Where is that requirement coming from? Not from ISO C, certainly. The behavior is undefined. It was the programmer's job to ensure that the return value is calculated by a well-defined expression, not that of the compiler. From that standpoint alone, it's perfectly conforming to emit code that just leaves the existing content in %eax and returns.
> There may be some quirk/feature of modern Intel processors
not so much of a quirk, but basic behaviour of any OoO processor due to register renaming. This is true on the vast majority of intel (and AMD) cpus (in fact probably all the currently sold ones as even atom has acquired a limited OoO engine and the knight variants are no more).
It is so fundamental that, as described elsethread zeroing a register is almost free (it still costs decode bandwidth and icache size).
Bottom line in the vast majority of cases, zeroing the registers is almost always a win.
edit: having said that I doubt that gcc is trying to optimize the UB case, probably it the optimizer just requires that the variable has a value at that point.
>It seems that the most efficient thing with %eax is not to mention it in an instruction.
I think what you're missing is that UB is a property of the execution of the program, and not just of the code that was compiled.
If in practice the if branch is always taken, then there is NO undefined behavior (surprisingly, perhaps)!
That means GCC has to be prepared to handle that case, and has to set eax to 0 when the if branch is taken.
>Where is that requirement coming from? Not from ISO C, certainly. The behavior is undefined.
Not so! You don't know until the program is run.
-
Edit: (removed some speculation about assuming the branch is always taken, that was not relevant)
>There may be some quirk/feature of modern Intel processors that clearing a register will dis-entangle it from considerations of prior hazards. So that is to say, when we execute 'xorl %eax, %eax', the processor knows that any prior value in `%eax` is no longer required by subsequent code. A new lifetime has begun for that register.
That's absolutely right, by the way. I think this was introduced with Sandy Bridge, in 2011.
>Look, GCC (what I have here: 7.3.0) is doing this even for the following trivial function:
You're absolutely right that in this case, it's wasteful.
In fact, if you try Clang you will see that it only emits a ret.
Maybe GCC is doing you a courtesy (or maybe nobody taught it to completely omit an Undef return value, so it just picks zero!).
Either way, it's unconditionally UB, GCC can do whatever it likes.
>do we still know until run-time that this has UB?
Well, there is no branch here, is there? I'm clearly not going to disagree with you on this one =]
To be clear: UB is still a property of the execution of the program, but here it's not hard to statically determine that all executions of this program are Undefined Behavior.
Actuallym it is: You're right to need an extra instruction, but you gain a register. Storing and reloading this register would take instructions too.
I threw this in godbolt, so play around, e.g. initialize ret or invert the condition. Linus' version has 36 lines, most other things I tried end up with 40 or more.
How do you gain a register by emitting xorl %eax, %eax versus simply not emitting that and simply letting the code access the garbage value that's already in %eax?
So, if the if branch is taken, then the function could really return 0, right?
So GCC has to have some instruction somewhere that ensures eax is 0 at the end of the function in case the if was taken.
That's why it can't just leave a garbage value in eax. Sometimes it's not actually garbage.
And when it really is garbage? Well who cares, might as well return 0.
I don't follow. If the branch on the inode link count is taken, then the external function shmem_reserve_inode is called; ret becomes its return value then. The only way the tail of the function runs with ret == 0 is if that external function happens to return zero. What I don't somehow see is the advantage of, or requirement for, preparing a zero value in ret that didn't come from shmem_reserve_inode, for a branch of the flow where UB is invoked which causes all behavioral requirements to be off the table.
Oh, I see. It's really a behavior of x86 that's causing the confusion then. The gist of it is that ret is not eax. Let me explain.
When shmem_reserve_inode is called, under the x86 C ABIs (all of them!), it will place its return value in eax.
So after the call, at this precise moment, "ret" is "eax", and we don't know it's value.
Then we do "if (ret) return ret", after this line "ret" is still "eax", and this time we know for sure that its value is 0.
But eax is not bound to ret until the end of the function. EVERY function we call before returning is allowed to modify eax (whether that function call returns a result or not)!
And since we know that ret has the value 0, well we don't even need to keep the value in register eax, do we? We can just use eax for something else and not worry about storing "ret" anywhere, it's just a constant now.
>What I don't somehow see is the advantage of, or requirement for, preparing a zero value in ret that didn't come from shmem_reserve_inode
That's the subtelty. We don't prepare ret, we prepare eax.
Conceptually, at the C language level, at the tail of the function "ret" still contains 0.
But at the x86 level, eax DOES NOT necessarily contain 0.
(Because we call d_instantiate and other functions before returning, and they're allowed to modify eax).
So ret has the right value, but we still need to load the value of ret, into eax before returning. And the most efficient way to do that? Just set eax to 0.
I agree; that does look indeed a good reason to set it to zero. Once the end of the function code is reached, ret is either zero or undefined, so there is no need to keep track of the value of ret; just assume that it is zero.
Now I can see how x86 C ABIs is working; thank you for explaining because I did not know much about that before, but now I know, and indeed it is making sense. (On a different instruction set, something else might make sense; I don't really know.)
I do understand that in all cases when the tail part of the function is executed such that it's well-defined (the only cases we care about), the return value is zero, which means it's effectively a constant. Since we have to prepare that return value in %eax (dictated by the ABI), then at some point we clear %eax; that's not semantically the same as initializing ret. If we do this later in the function, like just before returning, we can use %eax for other purposes in the meanwhile.
>I do understand that in all cases when the tail part of the function is executed [...] then at some point we clear %eax;
>If we do this later in the function, like just before returning, we can use %eax for other purposes in the meanwhile.
Okay, I think we completely agree here!
>that's not semantically the same as initializing ret.
Hmm, so is it that you're asking why we're initializing ret? Well, that sounds like a good excuse as any for another incredibly long and boring wall of text =]
If I'm extra lucky I'll get called out as the amateur I am and learn something in the process (!)
---
So, here's the thing. Talking about the compiler initializing ret as a variable, separate from its storage in eax is not really what the compiler is trying to do, not as I understand it.
We really aren't "initializing it" so much as trying to guess it's value in all possible executions (because that's just what compilers do these days).
The way the compiler works is that first it gets rid of the idea of variables that can be assigned multiple times, it switches to something called SSA form where every "variable" is initialized exactly once. Want to reassign a variable? Just declare one with a new name instead and use it going forward.
The funny part of SSA is that when there's a branch, after it joins back you end up having to define a variable that could have two possible values (branch taken, not taken).
That's not something you should normally be able to do with the SSA rules, so it's represented by a special Phi "value" in the compiler's intermediate representation.
A Phi basically just says "either we came from path A and we have value Va, or from B and it's Vb".
But what the compiler really wants is to forget about the branch and the Phi business, and just deduce a plain value for ret so it can move on with cold hard numbers in mind.
Since we have undefined behavior here, we can simplify Phi<Undef, [0, 0]> into just [0, 0]. And that's how the compiler "forgets" that ret was uninitialized in the first place. It just optimizes the undefinedness away while trying to guess values, if you will.
So long story short we're not so much trying to intialize ret than we're trying to guess it's value, and forget that there were multiple branches in the first place.
No, it picks 0 because that's required for the code path where the "if (ret) return ret;" doesn't return, for all the other code paths where it reaches the end of the subroutine ret is undefined so 0 is good for all possible paths to that point
Since I'm on a comment spree today: I think we mostly agree, actually, but note that the compiler doesn't have to pick 0 for the undefined value at all. It only does that because that's the most convenient value to pick.
(And for the code path where it's deduced to be 0, well it's not really picking, it doesn't have a choice =] )
A compiler with less optimizations could chose to spill the "ret" variable and not pick a value at all, for example. It would just return whatever happens to be lying on the stack if you took the unitialized path.
yes I agree, and it's optimised out a stack location for ret so no stack goop to default to
My point really is that there are two interesting paths here, one where it's undefined and one where it's 0 ... I guess my point is that 0 is more than just 'convenient' because it's required for one of those paths
Right, the bug in the kernel code snippet is like GP describes, but the gcc bug isn't that the compiled code is incorrect, it's that people expect an "uninitialized variable" warning and the compiler doesn't produce one.
I was wondering this as well, as I haven't been following things for a few months. But indeed he's back maintaining the kernel and seems far more civil than previously.
New Linus? He is a guy who sends dozens of mails every month and has cursed in a few handpicked ones. That doesn't make him especially "cursy". I'd say he's about average.
And Howard Dean was a fairly soft-spoken politician until one particularly excited outburst at a single rally happened to have been captured on mic.
The "actual" frequency of the outbursts is somewhat secondary to the perception of frequency. Every time Linus writes a profanity-laced rant, even if it's email number 83/175 for the day, that particular piece is preserved, archived, and passed around the internet for giggles, further reinforcing whatever pre-existing conception the person referencing the email is arguing for.
This is a variation of an unfixed 2004 GCC bug. Clang detects the issue when the right warnings are enabled. Those warnings are disabled currently. A developer is auditing and fixing all such instances found by Clang so that they can reenable those warnings.