First of all: fantastic article. In-depth, insightful, and the examples are absolutely top-notch. On the razor's edge between accessible and profound. Hats off to the author.
I will say that the problems seem to lie in a few interesting interlanguage quirks, and not so much on language specs. For example, LLVM and C have different definitions of "undefined behavior"[1] -- this is pointed out when looking at the `poison` identifier. In LLVM this might just be a weird side-effect we don't care about, but actually having an integer overflow is a big deal in C that introduces UB. Given the introduction, LLVM (incorrectly) introduces (C) UB at times.
The second section, on pointer provenance, is much trickier, as there's no "obvious" UB introduced there. Further, I would argue that pointer provenance is a meta-language (or meta-computational) concept (which the Rust docs hint at), and logically speaking, might be intractable in the general case (I'd have to think about this more). Pointer arithmetic can fiddle with meta-state, whereas IRs like LLVM tend to focus purely on symbol manipulation.
As a side note, I'd be curious what happens when LLVM sees something like: `void *p = &p` which is a self-referential pointer. What does it deduce about it? Does it optimize it away?
> In LLVM this might just be a weird side-effect we don't care about, but actually having an integer overflow is a big deal in C that introduces UB. Given the introduction, LLVM (incorrectly) introduces (C) UB at times.
This doesn't matter. LLVM IR is not C, and it is not convertible to C, as it implements a superset of C semantics, which includes some cases where behavior that is undefined in C is well-defined in LLVM IR. Once C is lowered to LLVM IR, it doesn't matter what C says is or isn't undefined behavior, because there is no longer any C code being considered.
Maybe I expressed myself a bit too literally, but I wouldn't say that how LLVM handles UB "doesn't matter." Look at how Clang compiles undefined behavior vs GCC (for example)[1]. A bit tautological, but different ways of handling UB can lead to different ways UB behaves.
> LLVM IR is not C, and it is not convertible to C...
This isn't strictly true. There used to be an "official" C(pp) backend, which has been recently resurrected[2]. You can definitely go from LLVM to C -- probably won't be very readable, though.
> As a side note, I'd be curious what happens when LLVM sees something like: `void <asterisk>p = &p` which is a self-referential pointer. What does it deduce about it? Does it optimize it away?
Why would it do anything special here? This is not really different from something like this:
struct a { struct a *p; };
struct a x = { &a };
The only real difference is that the void version involves a cast, because the type is otherwise impossible to express in C.
> Why would it do anything special here? This is not really different from something like this:
Yeah, I suppose you're right. I was Googling potential interesting cases and ran across this one[1] which made me think of weird edge cases w.r.t. self-reference.
There isn’t anything weird about this. A circularly linked list is a common data structure that, when empty, has a pointer pointing to itself.
Self-referential structures can be optimized in accordance with all the usual rules. If there are no other references to them, they can be optimized away. (This optimization is implemented by using a flood fill to mark everything that’s used, and then removing everything else, so that circular references need no special treatment.)
I'm pretty sure there is no UB in the second part. It's just LLVM bug. The pointer comparison is valid (pointer to and object and to an another object that's one past the end MAY compare equal). The write is valid.
Reads and writes to char* always alias everything. Unless the compiler can prove that no writes happened it has to emit a read. So if it "forgets" due to the uintptr_t cast where the char* came from then it must assume that it can be pretty much arbitrary and must emit a read.
Everybody seems to agree it is an LLVM bug, but "UB" is overloaded. There is no "C/C++ UB" in the original program, but the optimizers can introduce "LLVM UB" in IR form, which then results in something substantially equivalent to what the mock transformed source codes shown for illustration would probably be compiled to. And introducing UB when there was none in the source code means there is a compiler bug.
On your remark about char*, it is an universal type alias, but I don't think it is an universal provenance alias, sadly I don't think such a thing even exist de-facto, and it will not even exist more formally when considering the PVNI-ae-ud model that is being cooked. Probably unwise to lean as usual on the aggressive optimisations side without even proving that the perf impact is that much interesting, evaluating the security / quality and education impact, if you ask me. And even more problematic without even providing an escape hatch (1). But I know very well that state of mind has just won for now and I have to cope with it. Even more C programs will be retroactively declared completely incorrect (and not even just not-portable) because the compilers went too agressive at one point and the standard just standardized their aggressiveness instead of telling implementers to just stop being crazy.
(1) beyond, for some potential programs that would be otherwise impossible to express in C/C++, exposing all the pointers. Well exposing all the pointers would be cute but the result would not be that much different from not having that kind of "opti". Plus you would have to actually expose all your target pointers, so it is not really source compatible with unmodified existing codebases. So a per-access or at least per-pointer-doing-the-accesses universal provenance alias is needed to get a serious escape hatch.
I'm also not extremely sure we can actually implement a memory allocator in C or C++ today (with whatever the modern compiler are "optimizing" with their partly informal provenance rules), nor that we will be able to with PVNI-ae-ud (broadly same thing except slightly debugged). (Or maybe it will merely constrain the implementation? not sure anyway)
> On your remark about char*, it is an universal type alias, but I don't think it is an universal provenance alias
C doesn't have a concept of provenance in this fashion, or alternatively the compiler must assume the pointers may alias. This is why we have the restrict keyword.
The only cases where the compiler can assume things do not alias are when pointers are to different types (except for char).
Naturally the compiler is allowed to prove that some pointers cannot alias and optimize based on that. But if it messes up it's a compiler bug pure and simply.
Neither does C++, IIRC the compilers invented the notion because tons of operations are forbidden between pointers to different objects, so by tracking the provenance you can e.g. detect conditions leading to UB and trim them, because as usual why would the programmer ever write a bug (from the point of view of strict conformity to the standard)? So if it is not a bug thanks to our perfect programmer that of course knows the standard by heart even better than compiler authors apparently do, that must be that this condition is always false, code path is dead, etc.
Hilarity ensues when compiler authors themselves introduce bugs in compliant programs thanks to this line of thought that they often take way to far.
So again: of course this is a compiler bug. But it is caused by an attempt to use provenance analysis for aliasing (that could indirectly be allowed, at least in a non-buggy form, because of some arcane C or C++ rules) that was not implemented correctly. Type based aliasing is more simple because the rules lead to it slightly more directly.
As usual; if a compiler knows about undefined behavior I would much rather it throw an error rather than optimize something the programmer didn't intend based on the compiler out-smarting a human's ability to be specific.
I've heard this proposal thrown around a lot on HN, but how would that even work? Can you describe how the compiler would do this, what patterns it would look for, and what error messages it would produce? For example, consider this:
int factorial(int n) {
int r = 1;
for (int i = 1; i <= n; i++) {
r *= i;
}
return r;
}
Since there are two instances of possible UB in the above function, what error messages would you like the compiler to produce?
First of all, the compiler can assume that the loop terminates. This is a reasonable assumption from a human perspective—the loop is almost certainly intended to run N times. However, if n = INT_MAX, then there’s UB… and there’s no real way to know that n ≠ INT_MAX.
You could argue that you want -fsanitize=undefined, where the compiler inserts checks at runtime.
I don’t think this proposal is well formulated enough to be viable.
What I want is for the compiler to have a mode (which ideally would be enabled by default, at least with modern compile targets) that never optimizes based on undefined behavior.
It would do this by handing any UB as an error and making the programmer write a correct program that lacks UB through informing the programmer where undefined behavior has been proven by the compiler. This would improve both the code and the programmer's skills, rather than trying to make each out-think the other.
The big gap is that it's not that compilers optimize on "having UB" but rather on "having a potential for UB".
It's not about places where "undefined behavior has been proven by the compiler", it's about all the many places in completely correct code where (as far as the compiler can reason) the code might have UB but should not if it is correct code.
The question is about what to do when the programmer has written a completely correct program that lacks UB. The current approach is to assume that it actually does lack UB and that the code path that would contain the UB is unreachable (though in a manner that the compiler can't prove). The other approach is to insert run-time checks in every such place, which is a heavy performance hit as there are many such places (e.g. every integer addition).
Requiring programmers to eliminate every place in a C program that has a potential for UB is not feasible - cleanly proving for every every integer increment and every pointer dereference that the input data can't cause UB is impossible in the general case, and assuming that all of these can cause UB and pushing warnings means that every second line of perfectly correct code will have a warning that you can't fix, because the code is correct as-is.
The compiler isn't __changing any programmer dictated behavior__. There are no UB sourced 'optimizations' being implicitly disabled and none have been expressly enabled. As long as valid code compiles it's on the humans that wrote the code (or that triggered it's writing in some higher level synthesis tool).
Expanding on this; I don't want compilers _making_ optimizations based on UB. I want them educating the programmers so that the UB can be eliminated at a human level and correct outcomes are the result, optimized if it makes sense.
But this is a misunderstanding. The “compiler changing behavior” isn’t actually happening in those cases. The program’s behavior never changed. We just have a wrong understanding of the program’s behavior as humans reading the source. If we were to codify and write down all of the expectations we might have about what behavior a source file might represent so that the compiler can match it exactly.... that’d be the standard and we are back where we started.
That's one way, but I don't want the compiler attempting to out-think whatever wrote the code. If the code has undefined behavior, if the code has a non-obvious transformation that would make it faster (but IE changes the ordering, referenced variables, etc) that should be presented for integration to the source code, NOT silently performed for that single compilation. Never _make_ the optimization for the programmer. It's fine to have it as an error and suggest the optimal variant instead __for human review__.
"I don't want compilers _making_ optimizations based on UB. I want them educating the programmers so that the UB can be eliminated at a human level and correct outcomes are the result, optimized if it makes sense."
What you are describing is formal methods. You can take a subset of C, use a formally-verified toolchain, and write your own formally-verified C code using proof systems.
Or perhaps what you want is a different language.
It’s just that what you’re describing—having the compiler detect UB and warn you about it, but still letting you write C code—without the burden of also writing correctness proofs using some formalism—it’s just not a viable option.
From a usability perspective, if you turn on aggressive static analysis / warning settings like this, what you usually end up with is programmers that just start ignoring warnings or turning them off, except for greenfield projects. C’s semantics don’t make this kind of thing easy. If you have a greenfield project and you are fighting the language this hard, that’s precisely the time when using a different language is the most viable option.
That's impossible in the general case, since it's trivially equivalent to the halting problem (take the UB-detecting compiler, have it perform UB iff the program has no UB, run it on its own source code).
What we can do is a combination of formal methods, heuristics and sanitizers. Plus software engineering techniques, such as running tests to high coverage with sanitizers, fuzzing, and so on.
One challenge is that there are lots of cases where a program might have UB for a certain input, but a full-program analysis reveals that this UB is not actually possible to hit with any execution.
Should it still report UB? What if the full program analysis is beyond what the compiler is able to reason about?
So, do you mean that in the above loop example, it would require you to prove somehow that the loop always finishes, or if the programmer fails to do that, fail with "possibly infinite loop" error? In what language/system you would require that proof?
Detecting UB is far harder than the compiler simply assuming there is no UB. That said, there are already tools to do it for some cases (but far from all). They're not often used in C/C++, perhaps because they're slow.
For your first example, I think most people want integer overflow to be unspecified behavior instead of undefined behavior - this is how most other languages treat it, it is how all C compilers behaved for a long time, and it is unreasonably difficult to actually write C code that makes sure not to cause integer overflow.
Your example is in fact perfect for why that should be the case: consider the following code:
int n = 0;
scanf("%d", &n)
factorial(n); //using your definition of factorial(int)
A standard-compliant C compiler is apparently free to optimize this program to format your hard-drive.
For your second example, I would say that, rather than omitting the NULL check, a helpful compiler could do one of two things:
1. Don't reason from strlen(s) to s != NULL, so the NULL check must be left in (unless it has even more context and can see a non-NULL assignment to s)
2. Wherever trying to optimize away a comparison to NULL based on UB reasoning, put in a new comparison to NULL at the place you applied that reasoning. For this example, optimize the program as if it looked like this:
//first, the inline version
void puts_with_len(const char *s) {
s_len = strlen(s);
printf("len = %zu\n", s_len);
const char* puts_arg = s == NULL ? "(null)" : s
puts(puts_arg)
}
//after optimization
void puts_with_len(const char *s) {
s_len = strlen(s);
const char* puts_arg;
if (s != NULL) {
puts_arg = s;
} else {
puts_arg = "(null)"; //could also explicitly signal a segfault or assert here instead
}
printf("len = %zu\n", s_len);
puts(puts_arg);
}
In this case we didn't gain anything, but if puts_with_len were itself inlined the check would be moved further back, potentially replacing many NULL checks with a single one.
I would note that there is a third option here that goes in a different direction: now that compilers are extremely aggressive with NULL check removal optimizations, a lot of unsafe C functions could be made safe by manually adding the missing NULL checks to the stdlib and other major libraries. This wouldn't affect the semantics, and it wouldn't hurt performance assuming the optimizer really is doing its business.
For example, strlen() could itself raise an assertion/exit on strlen(NULL). If called from a context where it is known that s != NULL, the null check can be optimized away by the aggressive optimizer; if not, better safe than sorry.
If signed integer overflow is implementation defined rather than undefined then it isn’t an error and we cannot make compiler features that warn or reject when we can prove it will occur. In your case we’ve managed to get the worst of both worlds (a buggy program and no capacity for the compiler to stop you).
For a long time in C's history, for most platforms, int overflow was actually treated as well defined behavior, with many examples suggesting to use tests like x + y < x (assuming positive integers) to detect overflow.
In modern C there is simply no portable way to easily check for integer overflow for 64-bit values, even though the vast majority of programs are running on a processor that defines exactly what happens with integer overflow, and even sets a flag that can be tested for in a single jump.
People often cite for loops over arrays as an example of places where treating integer overflow as UB helps with optimizations. This despite the fact that the recommended, standards compliant portable way to iterate over the range of indices in an array is to use a size_t index variable, which is an unsigned type.
> even though the vast majority of programs are running on a processor that defines exactly what happens with integer overflow, and even sets a flag that can be tested for in a single jump
Widths matter. Platforms that do this don't overflow both 32 and 64 bit signed integers. So if you want to define signed integer overflow for all signed integer widths then for one (or both) of these widths you need to stick in runtime checks.
It would sometimes work but I think this (very common comment) misses the general point, there's absolutely fine and non buggy on warning-worthy code that can be optimized away if the compiler relies on UB. A very simple example:
do_stuff calls a subroutine that does a NULL pointer check, then unconditionally dereferences the same pointer.
From this the compiler, if it decides to inline that code, can decide to optimize the NULL check away since if the pointer is NULL it's guaranteed to trigger UB. The logic being "clearly the programmer assumes that the pointer can't be NULL here, so thanks to UB rules I can too".
There's nothing wrong with this code, there's no reason to emit a warning.
Distinguishing between "of course that's a reasonable optimization, that's probably what the developer intended" and "wow this compiler is so dumb, obviously I never meant for that to happen" is a very tough nut to crack.
At this point you can push the blame to the C language not being expressive enough, in this case not giving us a way to express nullable vs. non-nullable pointers within the language, which forces the compiler to lean onto these UB heuristics to optimize.
Even in your heavily contrived example, I can think of cases where the optimization isn't what the programmer wants. For instance, I might have a special handler via userfaultfd(2) that detects if I'm doing an increment of the null pointer and handles it in some special way, but can't handle just setting it to 10. For a more real example, I might have acquired a lock around do_substuff, and I might be okay with the thread segfaulting only if the lock wasn't held.
I don't find my example contrived at all, having functions assuming non-NULL pointers call other subroutines that defend against such a thing is definitely a routine thing in my experience. Maybe "do_substuff" is called in contexts where the pointer could be NULL, or maybe its developer was paranoid.
I don't think it's reasonable to expect compilers to issue less optimized code to defend against coders doing smart things that are well beyond the scope of the standard. If you want to play fast and loose with segfault handlers then be my guest, but if you want to play with fire you better know what you're doing.
Note that many of C's UBs are also generally effectively unportable, different systems will react differently to things like NULL pointer dereferences (segfault, access to some mapped memory, access to nothing at all or a special handler like the one you described) and signed overflow (overflow to 2s complement, saturation, trap etc...).
I think blaming UBs is frankly the wrong target. The problem with C is that it doesn't let you express those constraints to let the compiler enforce them and let you know when you're doing something wrong. I can't tell the compiler "this pointer is not nullable" and have it keep track of it for me, warning me if I mess up. In contrast in a language like Rust I could use an Option<> type to encode this, and I get a compilation error if I have a mismatch.
That's what I want to see in a more modern C, not a babysitting compiler that emits suboptimal code because it's trying to make my coding mistakes somewhat less mistakeful.
If the programmer really needs reads and writes through particular pointers to happen in a particular sequence, because the target memory follows different rules than the language ordinarily allows the compiler to assume, then it’s the programmer’s responsibility to use the annotation provided by the language for exactly that purpose: volatile. If the compiler had to assume that every pointer needs to be treated as volatile, just about all generated code would be be slowed down dramatically.
As for locks, the language already has rules about which memory access are allowed to be reordered across lock acquire and release calls. Otherwise locks wouldn’t work correctly at all.
To be extra pedantic, that's not what volatile does. Volatile ensures that access through the variable must behave strictly according to the rules of the C abstract machine, but the definition of access is implementation defined. A compiler author could define "access" to be "reads from the variable" or "writes to the variable", or neither, and make the entire keyword useless. As long as they document that somewhere, it's compliant with the standard. You think it means "reads and writes" but it doesn't have to.
It's tempting to write a "malicious compliance C compiler" that complies with the standard but makes all the most perverse possible choices.
Your comment is technically correct, but the original description of “volatile” in the parent comment is much clearer. It may be fun to imagine what would happen if the compiler writer were malicious, but it doesn’t really help us understand C.
Just like C compilers assume that the program is correct, the writers of the C standard assume that the people reading it are not insane or malicious.
> A compiler author could define "access" to be "reads from the variable" or "writes to the variable", or neither, and make the entire keyword useless. As long as they document that somewhere, it's compliant with the standard.
Nobody is going to accept that as a compliant implementation.
The compiler very rarely knows that UB exists in the program.
To give a concrete example: the compiler sees this:
```
while (<expr>) {
if (p == 0) { do_thing(); }
*p = 42;
}
```
In this case, the compiler wants to move `do_thing()` outside the loop because it's expensive. It also knows that it would be UB for this branch to execute if `*p = 42` executes, because that would result in UB.
Therefore the compiler can assume that if the loop runs at least once, then `p != 0`. At no point does the compiler have any information about whether the program is in fact UB.
> Further, I would argue that pointer provenance is a meta-language (or meta-computational) concept (which the Rust docs hint at), and logically speaking, might be intractable in the general case (I'd have to think about this more).
Unless we define provenance to only work within certain easy-to-calculate boundaries. But for the general case, sure, just don't optimize when there's uncertainty.
I will say that the problems seem to lie in a few interesting interlanguage quirks, and not so much on language specs. For example, LLVM and C have different definitions of "undefined behavior"[1] -- this is pointed out when looking at the `poison` identifier. In LLVM this might just be a weird side-effect we don't care about, but actually having an integer overflow is a big deal in C that introduces UB. Given the introduction, LLVM (incorrectly) introduces (C) UB at times.
The second section, on pointer provenance, is much trickier, as there's no "obvious" UB introduced there. Further, I would argue that pointer provenance is a meta-language (or meta-computational) concept (which the Rust docs hint at), and logically speaking, might be intractable in the general case (I'd have to think about this more). Pointer arithmetic can fiddle with meta-state, whereas IRs like LLVM tend to focus purely on symbol manipulation.
As a side note, I'd be curious what happens when LLVM sees something like: `void *p = &p` which is a self-referential pointer. What does it deduce about it? Does it optimize it away?
Cool thought-provoking article.
[1] https://www.cs.utah.edu/~regehr/llvm-ub.pdf