Pointers Are Complicated II, or: We need better language specs

dvt · on Dec 14, 2020

First of all: fantastic article. In-depth, insightful, and the examples are absolutely top-notch. On the razor's edge between accessible and profound. Hats off to the author.

I will say that the problems seem to lie in a few interesting interlanguage quirks, and not so much on language specs. For example, LLVM and C have different definitions of "undefined behavior"[1] -- this is pointed out when looking at the `poison` identifier. In LLVM this might just be a weird side-effect we don't care about, but actually having an integer overflow is a big deal in C that introduces UB. Given the introduction, LLVM (incorrectly) introduces (C) UB at times.

The second section, on pointer provenance, is much trickier, as there's no "obvious" UB introduced there. Further, I would argue that pointer provenance is a meta-language (or meta-computational) concept (which the Rust docs hint at), and logically speaking, might be intractable in the general case (I'd have to think about this more). Pointer arithmetic can fiddle with meta-state, whereas IRs like LLVM tend to focus purely on symbol manipulation.

As a side note, I'd be curious what happens when LLVM sees something like: `void *p = &p` which is a self-referential pointer. What does it deduce about it? Does it optimize it away?

Cool thought-provoking article.

[1] https://www.cs.utah.edu/~regehr/llvm-ub.pdf

jcranmer · on Dec 15, 2020

> In LLVM this might just be a weird side-effect we don't care about, but actually having an integer overflow is a big deal in C that introduces UB. Given the introduction, LLVM (incorrectly) introduces (C) UB at times.

This doesn't matter. LLVM IR is not C, and it is not convertible to C, as it implements a superset of C semantics, which includes some cases where behavior that is undefined in C is well-defined in LLVM IR. Once C is lowered to LLVM IR, it doesn't matter what C says is or isn't undefined behavior, because there is no longer any C code being considered.

dvt · on Dec 15, 2020

> This doesn't matter.

Maybe I expressed myself a bit too literally, but I wouldn't say that how LLVM handles UB "doesn't matter." Look at how Clang compiles undefined behavior vs GCC (for example)[1]. A bit tautological, but different ways of handling UB can lead to different ways UB behaves.

> LLVM IR is not C, and it is not convertible to C...

This isn't strictly true. There used to be an "official" C(pp) backend, which has been recently resurrected[2]. You can definitely go from LLVM to C -- probably won't be very readable, though.

[1] https://blog.llvm.org/posts/2011-05-21-what-every-c-programm...

[2] https://github.com/JuliaComputing/llvm-cbe

lambdatronics · on Dec 15, 2020

> different ways of handling UB can lead to different ways UB behaves.

Ha! If something depends on how 'undefined behavior' behaves, then it's not really undefined, is it? :D

klodolph · on Dec 14, 2020

> As a side note, I'd be curious what happens when LLVM sees something like: `void <asterisk>p = &p` which is a self-referential pointer. What does it deduce about it? Does it optimize it away?

Why would it do anything special here? This is not really different from something like this:

    struct a { struct a *p; };
    struct a x = { &a };

The only real difference is that the void version involves a cast, because the type is otherwise impossible to express in C.

dvt · on Dec 14, 2020

> Why would it do anything special here? This is not really different from something like this:

Yeah, I suppose you're right. I was Googling potential interesting cases and ran across this one[1] which made me think of weird edge cases w.r.t. self-reference.

[1] https://stackoverflow.com/questions/20596856/self-referentia...

anderskaseorg · on Dec 15, 2020

There isn’t anything weird about this. A circularly linked list is a common data structure that, when empty, has a pointer pointing to itself.

Self-referential structures can be optimized in accordance with all the usual rules. If there are no other references to them, they can be optimized away. (This optimization is implemented by using a flood fill to mark everything that’s used, and then removing everything else, so that circular references need no special treatment.)

sharpneli · on Dec 14, 2020

I'm pretty sure there is no UB in the second part. It's just LLVM bug. The pointer comparison is valid (pointer to and object and to an another object that's one past the end MAY compare equal). The write is valid.

Reads and writes to char* always alias everything. Unless the compiler can prove that no writes happened it has to emit a read. So if it "forgets" due to the uintptr_t cast where the char* came from then it must assume that it can be pretty much arbitrary and must emit a read.

temac · on Dec 15, 2020

Everybody seems to agree it is an LLVM bug, but "UB" is overloaded. There is no "C/C++ UB" in the original program, but the optimizers can introduce "LLVM UB" in IR form, which then results in something substantially equivalent to what the mock transformed source codes shown for illustration would probably be compiled to. And introducing UB when there was none in the source code means there is a compiler bug.

On your remark about char*, it is an universal type alias, but I don't think it is an universal provenance alias, sadly I don't think such a thing even exist de-facto, and it will not even exist more formally when considering the PVNI-ae-ud model that is being cooked. Probably unwise to lean as usual on the aggressive optimisations side without even proving that the perf impact is that much interesting, evaluating the security / quality and education impact, if you ask me. And even more problematic without even providing an escape hatch (1). But I know very well that state of mind has just won for now and I have to cope with it. Even more C programs will be retroactively declared completely incorrect (and not even just not-portable) because the compilers went too agressive at one point and the standard just standardized their aggressiveness instead of telling implementers to just stop being crazy.

(1) beyond, for some potential programs that would be otherwise impossible to express in C/C++, exposing all the pointers. Well exposing all the pointers would be cute but the result would not be that much different from not having that kind of "opti". Plus you would have to actually expose all your target pointers, so it is not really source compatible with unmodified existing codebases. So a per-access or at least per-pointer-doing-the-accesses universal provenance alias is needed to get a serious escape hatch. I'm also not extremely sure we can actually implement a memory allocator in C or C++ today (with whatever the modern compiler are "optimizing" with their partly informal provenance rules), nor that we will be able to with PVNI-ae-ud (broadly same thing except slightly debugged). (Or maybe it will merely constrain the implementation? not sure anyway)

sharpneli · on Dec 15, 2020

> On your remark about char*, it is an universal type alias, but I don't think it is an universal provenance alias

C doesn't have a concept of provenance in this fashion, or alternatively the compiler must assume the pointers may alias. This is why we have the restrict keyword.

The only cases where the compiler can assume things do not alias are when pointers are to different types (except for char).

Naturally the compiler is allowed to prove that some pointers cannot alias and optimize based on that. But if it messes up it's a compiler bug pure and simply.

temac · on Dec 15, 2020

Neither does C++, IIRC the compilers invented the notion because tons of operations are forbidden between pointers to different objects, so by tracking the provenance you can e.g. detect conditions leading to UB and trim them, because as usual why would the programmer ever write a bug (from the point of view of strict conformity to the standard)? So if it is not a bug thanks to our perfect programmer that of course knows the standard by heart even better than compiler authors apparently do, that must be that this condition is always false, code path is dead, etc.

Hilarity ensues when compiler authors themselves introduce bugs in compliant programs thanks to this line of thought that they often take way to far.

So again: of course this is a compiler bug. But it is caused by an attempt to use provenance analysis for aliasing (that could indirectly be allowed, at least in a non-buggy form, because of some arcane C or C++ rules) that was not implemented correctly. Type based aliasing is more simple because the rules lead to it slightly more directly.

rwj · on Dec 15, 2020

Your comment about char* aliasing everything is for C, not LLVM.

mjevans · on Dec 14, 2020

As usual; if a compiler knows about undefined behavior I would much rather it throw an error rather than optimize something the programmer didn't intend based on the compiler out-smarting a human's ability to be specific.

klodolph · on Dec 14, 2020

I've heard this proposal thrown around a lot on HN, but how would that even work? Can you describe how the compiler would do this, what patterns it would look for, and what error messages it would produce? For example, consider this:

    int factorial(int n) {
        int r = 1;
        for (int i = 1; i <= n; i++) {
           r *= i;
        }
        return r;
    }

Since there are two instances of possible UB in the above function, what error messages would you like the compiler to produce?

First of all, the compiler can assume that the loop terminates. This is a reasonable assumption from a human perspective—the loop is almost certainly intended to run N times. However, if n = INT_MAX, then there’s UB… and there’s no real way to know that n ≠ INT_MAX.

You could argue that you want -fsanitize=undefined, where the compiler inserts checks at runtime.

I don’t think this proposal is well formulated enough to be viable.

Or consider something like this:

    void puts_safe(const char *s) {
      puts(s == NULL ? "(null)" : s)
    }

    void puts_with_len(const char *s) {
      printf("len = %zu\n", strlen(s));
      puts_safe(s);
    }

Do you want something like,

    Warning: puts_safe replaced with puts, because s != NULL, because strlen(NULL) is undefined

I suspect there would be a mountain of false positives, and I would rather just use a static analyzer.

mjevans · on Dec 15, 2020

What I want is for the compiler to have a mode (which ideally would be enabled by default, at least with modern compile targets) that never optimizes based on undefined behavior.

It would do this by handing any UB as an error and making the programmer write a correct program that lacks UB through informing the programmer where undefined behavior has been proven by the compiler. This would improve both the code and the programmer's skills, rather than trying to make each out-think the other.

PeterisP · on Dec 15, 2020

The big gap is that it's not that compilers optimize on "having UB" but rather on "having a potential for UB".

It's not about places where "undefined behavior has been proven by the compiler", it's about all the many places in completely correct code where (as far as the compiler can reason) the code might have UB but should not if it is correct code.

The question is about what to do when the programmer has written a completely correct program that lacks UB. The current approach is to assume that it actually does lack UB and that the code path that would contain the UB is unreachable (though in a manner that the compiler can't prove). The other approach is to insert run-time checks in every such place, which is a heavy performance hit as there are many such places (e.g. every integer addition).

Requiring programmers to eliminate every place in a C program that has a potential for UB is not feasible - cleanly proving for every every integer increment and every pointer dereference that the input data can't cause UB is impossible in the general case, and assuming that all of these can cause UB and pushing warnings means that every second line of perfectly correct code will have a warning that you can't fix, because the code is correct as-is.

saagarjha · on Dec 15, 2020

This is untenable in the general case. For example, consider this function:

  int dereference(int *p) {
      return *p;
  }

What should the compiler do in this case? If I pass (int *)0x1283412 into this function, that's undefined behavior…should it just always warn?

mjevans · on Dec 15, 2020

The compiler isn't __changing any programmer dictated behavior__. There are no UB sourced 'optimizations' being implicitly disabled and none have been expressly enabled. As long as valid code compiles it's on the humans that wrote the code (or that triggered it's writing in some higher level synthesis tool).

Expanding on this; I don't want compilers _making_ optimizations based on UB. I want them educating the programmers so that the UB can be eliminated at a human level and correct outcomes are the result, optimized if it makes sense.

UncleMeat · on Dec 15, 2020

But this is a misunderstanding. The “compiler changing behavior” isn’t actually happening in those cases. The program’s behavior never changed. We just have a wrong understanding of the program’s behavior as humans reading the source. If we were to codify and write down all of the expectations we might have about what behavior a source file might represent so that the compiler can match it exactly.... that’d be the standard and we are back where we started.

klodolph · on Dec 15, 2020

So, you want something like,

    WARNING: p may be NULL (undefined behavior)

mjevans · on Dec 18, 2020

That's one way, but I don't want the compiler attempting to out-think whatever wrote the code. If the code has undefined behavior, if the code has a non-obvious transformation that would make it faster (but IE changes the ordering, referenced variables, etc) that should be presented for integration to the source code, NOT silently performed for that single compilation. Never _make_ the optimization for the programmer. It's fine to have it as an error and suggest the optimal variant instead __for human review__.

"I don't want compilers _making_ optimizations based on UB. I want them educating the programmers so that the UB can be eliminated at a human level and correct outcomes are the result, optimized if it makes sense."

klodolph · on Dec 15, 2020

What you are describing is formal methods. You can take a subset of C, use a formally-verified toolchain, and write your own formally-verified C code using proof systems.

Or perhaps what you want is a different language.

It’s just that what you’re describing—having the compiler detect UB and warn you about it, but still letting you write C code—without the burden of also writing correctness proofs using some formalism—it’s just not a viable option.

From a usability perspective, if you turn on aggressive static analysis / warning settings like this, what you usually end up with is programmers that just start ignoring warnings or turning them off, except for greenfield projects. C’s semantics don’t make this kind of thing easy. If you have a greenfield project and you are fighting the language this hard, that’s precisely the time when using a different language is the most viable option.

scatters · on Dec 15, 2020

That's impossible in the general case, since it's trivially equivalent to the halting problem (take the UB-detecting compiler, have it perform UB iff the program has no UB, run it on its own source code).

What we can do is a combination of formal methods, heuristics and sanitizers. Plus software engineering techniques, such as running tests to high coverage with sanitizers, fuzzing, and so on.

aliceryhl · on Dec 15, 2020

One challenge is that there are lots of cases where a program might have UB for a certain input, but a full-program analysis reveals that this UB is not actually possible to hit with any execution.

Should it still report UB? What if the full program analysis is beyond what the compiler is able to reason about?

GolDDranks · on Dec 15, 2020

So, do you mean that in the above loop example, it would require you to prove somehow that the loop always finishes, or if the programmer fails to do that, fail with "possibly infinite loop" error? In what language/system you would require that proof?

ChrisSD · on Dec 15, 2020

Detecting UB is far harder than the compiler simply assuming there is no UB. That said, there are already tools to do it for some cases (but far from all). They're not often used in C/C++, perhaps because they're slow.

tsimionescu · on Dec 15, 2020

For your first example, I think most people want integer overflow to be unspecified behavior instead of undefined behavior - this is how most other languages treat it, it is how all C compilers behaved for a long time, and it is unreasonably difficult to actually write C code that makes sure not to cause integer overflow.

Your example is in fact perfect for why that should be the case: consider the following code:

  int n = 0;
  scanf("%d", &n)
  factorial(n); //using your definition of factorial(int)

A standard-compliant C compiler is apparently free to optimize this program to format your hard-drive.

For your second example, I would say that, rather than omitting the NULL check, a helpful compiler could do one of two things:

1. Don't reason from strlen(s) to s != NULL, so the NULL check must be left in (unless it has even more context and can see a non-NULL assignment to s) 2. Wherever trying to optimize away a comparison to NULL based on UB reasoning, put in a new comparison to NULL at the place you applied that reasoning. For this example, optimize the program as if it looked like this:

  //first, the inline version  
  void puts_with_len(const char *s) {
    s_len = strlen(s);
    printf("len = %zu\n", s_len);
    const char* puts_arg = s == NULL ? "(null)" : s
    puts(puts_arg)
  }
  //after optimization
  void puts_with_len(const char *s) {
    s_len = strlen(s);
    const char* puts_arg;
    if (s != NULL) {
      puts_arg = s;
    } else {
      puts_arg = "(null)"; //could also explicitly signal a segfault or assert here instead
    }
    printf("len = %zu\n", s_len);    
    puts(puts_arg);
  }

In this case we didn't gain anything, but if puts_with_len were itself inlined the check would be moved further back, potentially replacing many NULL checks with a single one.

I would note that there is a third option here that goes in a different direction: now that compilers are extremely aggressive with NULL check removal optimizations, a lot of unsafe C functions could be made safe by manually adding the missing NULL checks to the stdlib and other major libraries. This wouldn't affect the semantics, and it wouldn't hurt performance assuming the optimizer really is doing its business.

For example, strlen() could itself raise an assertion/exit on strlen(NULL). If called from a context where it is known that s != NULL, the null check can be optimized away by the aggressive optimizer; if not, better safe than sorry.

UncleMeat · on Dec 15, 2020

If signed integer overflow is implementation defined rather than undefined then it isn’t an error and we cannot make compiler features that warn or reject when we can prove it will occur. In your case we’ve managed to get the worst of both worlds (a buggy program and no capacity for the compiler to stop you).

tsimionescu · on Dec 15, 2020

For a long time in C's history, for most platforms, int overflow was actually treated as well defined behavior, with many examples suggesting to use tests like x + y < x (assuming positive integers) to detect overflow.

In modern C there is simply no portable way to easily check for integer overflow for 64-bit values, even though the vast majority of programs are running on a processor that defines exactly what happens with integer overflow, and even sets a flag that can be tested for in a single jump.

People often cite for loops over arrays as an example of places where treating integer overflow as UB helps with optimizations. This despite the fact that the recommended, standards compliant portable way to iterate over the range of indices in an array is to use a size_t index variable, which is an unsigned type.

UncleMeat · on Dec 16, 2020

> even though the vast majority of programs are running on a processor that defines exactly what happens with integer overflow, and even sets a flag that can be tested for in a single jump

Widths matter. Platforms that do this don't overflow both 32 and 64 bit signed integers. So if you want to define signed integer overflow for all signed integer widths then for one (or both) of these widths you need to stick in runtime checks.

simias · on Dec 14, 2020

It would sometimes work but I think this (very common comment) misses the general point, there's absolutely fine and non buggy on warning-worthy code that can be optimized away if the compiler relies on UB. A very simple example:

    void do_stuff(int *some_ptr) {
        do_substuff(some_ptr);

        *some_ptr += 2;
    }

    static void do_substuff(int *some_ptr) {
        if (some_ptr != NULL) {
            *some_ptr = 10;
        }
    }

do_stuff calls a subroutine that does a NULL pointer check, then unconditionally dereferences the same pointer.

From this the compiler, if it decides to inline that code, can decide to optimize the NULL check away since if the pointer is NULL it's guaranteed to trigger UB. The logic being "clearly the programmer assumes that the pointer can't be NULL here, so thanks to UB rules I can too".

There's nothing wrong with this code, there's no reason to emit a warning.

Distinguishing between "of course that's a reasonable optimization, that's probably what the developer intended" and "wow this compiler is so dumb, obviously I never meant for that to happen" is a very tough nut to crack.

At this point you can push the blame to the C language not being expressive enough, in this case not giving us a way to express nullable vs. non-nullable pointers within the language, which forces the compiler to lean onto these UB heuristics to optimize.

dooglius · on Dec 14, 2020

Even in your heavily contrived example, I can think of cases where the optimization isn't what the programmer wants. For instance, I might have a special handler via userfaultfd(2) that detects if I'm doing an increment of the null pointer and handles it in some special way, but can't handle just setting it to 10. For a more real example, I might have acquired a lock around do_substuff, and I might be okay with the thread segfaulting only if the lock wasn't held.

simias · on Dec 14, 2020

I don't find my example contrived at all, having functions assuming non-NULL pointers call other subroutines that defend against such a thing is definitely a routine thing in my experience. Maybe "do_substuff" is called in contexts where the pointer could be NULL, or maybe its developer was paranoid.

I don't think it's reasonable to expect compilers to issue less optimized code to defend against coders doing smart things that are well beyond the scope of the standard. If you want to play fast and loose with segfault handlers then be my guest, but if you want to play with fire you better know what you're doing.

Note that many of C's UBs are also generally effectively unportable, different systems will react differently to things like NULL pointer dereferences (segfault, access to some mapped memory, access to nothing at all or a special handler like the one you described) and signed overflow (overflow to 2s complement, saturation, trap etc...).

I think blaming UBs is frankly the wrong target. The problem with C is that it doesn't let you express those constraints to let the compiler enforce them and let you know when you're doing something wrong. I can't tell the compiler "this pointer is not nullable" and have it keep track of it for me, warning me if I mess up. In contrast in a language like Rust I could use an Option<> type to encode this, and I get a compilation error if I have a mismatch.

That's what I want to see in a more modern C, not a babysitting compiler that emits suboptimal code because it's trying to make my coding mistakes somewhat less mistakeful.

anderskaseorg · on Dec 14, 2020

If the programmer really needs reads and writes through particular pointers to happen in a particular sequence, because the target memory follows different rules than the language ordinarily allows the compiler to assume, then it’s the programmer’s responsibility to use the annotation provided by the language for exactly that purpose: volatile. If the compiler had to assume that every pointer needs to be treated as volatile, just about all generated code would be be slowed down dramatically.

As for locks, the language already has rules about which memory access are allowed to be reordered across lock acquire and release calls. Otherwise locks wouldn’t work correctly at all.

SAI_Peregrinus · on Dec 15, 2020

To be extra pedantic, that's not what volatile does. Volatile ensures that access through the variable must behave strictly according to the rules of the C abstract machine, but the definition of access is implementation defined. A compiler author could define "access" to be "reads from the variable" or "writes to the variable", or neither, and make the entire keyword useless. As long as they document that somewhere, it's compliant with the standard. You think it means "reads and writes" but it doesn't have to.

It's tempting to write a "malicious compliance C compiler" that complies with the standard but makes all the most perverse possible choices.

klodolph · on Dec 15, 2020

Your comment is technically correct, but the original description of “volatile” in the parent comment is much clearer. It may be fun to imagine what would happen if the compiler writer were malicious, but it doesn’t really help us understand C.

Just like C compilers assume that the program is correct, the writers of the C standard assume that the people reading it are not insane or malicious.

Dylan16807 · on Dec 15, 2020

> A compiler author could define "access" to be "reads from the variable" or "writes to the variable", or neither, and make the entire keyword useless. As long as they document that somewhere, it's compliant with the standard.

Nobody is going to accept that as a compliant implementation.

Diggsey · on Dec 15, 2020

The compiler very rarely knows that UB exists in the program.

To give a concrete example: the compiler sees this:

    ```
    while (<expr>) {
        if (p == 0) { do_thing(); }
        *p = 42;
    }
    ```

In this case, the compiler wants to move `do_thing()` outside the loop because it's expensive. It also knows that it would be UB for this branch to execute if `*p = 42` executes, because that would result in UB.

Therefore the compiler can assume that if the loop runs at least once, then `p != 0`. At no point does the compiler have any information about whether the program is in fact UB.

Dylan16807 · on Dec 14, 2020

> Further, I would argue that pointer provenance is a meta-language (or meta-computational) concept (which the Rust docs hint at), and logically speaking, might be intractable in the general case (I'd have to think about this more).

Unless we define provenance to only work within certain easy-to-calculate boundaries. But for the general case, sure, just don't optimize when there's uncertainty.

ralfj · on Dec 15, 2020

Thank you so much for the feedback :) . I spent more time on this than on my usual post, so it is great to hear that that has paid off.

ChrisSD · on Dec 14, 2020

This is a standalone sequel to "Pointers Are Complicated, or: What's in a Byte?" (https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html).

Which has been discussed on HN

2020: https://news.ycombinator.com/item?id=24376797

2018: https://news.ycombinator.com/item?id=17604402

ralfj · on Dec 14, 2020

Oh, there was a repost in 2020, that explains the sudden spike in page hits that I saw earlier this year and couldn't trace back. ;)

TrianguloY · on Dec 14, 2020

I'm sorry but I don't fully understand the problem and that 'provenance' thing. For me the third optimization is the wrong one, for the same reason as this

  char i,j='0';
  *(&i+1)='1';
  cout << j;

can't be optimized to

  cout << '0';

even though the j variable is also never overwritten directly.

This reminds me of paralelization of nested loops with pragmas, where you need to specifically say that two pointers will never point to the same space, otherwise the compiler won't optimize the loop because 'it could happen'.

Someone · on Dec 14, 2020

If you think that, you’re giving up lots and lots of optimization opportunities.

There’s zero guarantee that i and j are adjacent on the stack or even on the stack (there isn’t even a guarantee that there is a stack, but that’s a different subject); a compiler can decide to keep j in a register. That is very common in short functions, and essential for performance.

It also would mean the compiler would have to load data from memory way more often, as ¿about? every pointer write might overwrite any memory.

dataflow · on Dec 14, 2020

But this is LLVM IR, not C. It feels like a trick question (hear me out). Either with IR semantics the code is already UB (as in C), or it isn't. If the code isn't UB, then why? The obvious assumption is that it's because LLVM IR treats this as a model of a concrete machine (where provenance isn't a thing), not as a C-like abstract machine with UB. In which case, the optimization #3 is clearly invalid. OTOH, if it is already UB, then the optimizations are fine, right?

The only way I can see that we can implicate optimization #1 or #2 is by claiming LLVM IR has neither C semantics nor machine semantics, but with some kind of in-between wobbly semantics. There's nothing fundamentally wrong with that, but that's not what a reader would assume in the absence of other information. Especially not when the author writes "this program has two possible behaviors"—this clearly tells us there's no UB and that we should assume this is based on concrete machine semantics (where provenance isn't a thing). So, I would also say, given the info in the post prior to the conclusion, #3 has to be the culprit. I just can't implicate #1 or #2 based on the preceding text; to implicate those you'd have to pull assumptions out of thin air.

lmm · on Dec 15, 2020

> Either with IR semantics the code is already UB (as in C), or it isn't. If the code isn't UB, then why?

With C-like semantics the code isn't UB. Why would it be? Taking a one-past-the-end pointer is legit, comparing it is legit, casting a pointer to an integer is legit.

dataflow · on Dec 15, 2020

Comparing pointers derived from entirely different objects (or arrays) isn't legit. See here: https://www.viva64.com/en/b/0576/

"Pointer arithmetic is only defined for pointers pointing into array objects or one past the last element. Comparing pointers for equality is defined if both pointers are derived from the same (multidimensional) array object. Thus, if two pointers point to different array objects, then these array objects must be subaggregates of the same multidimensional array object in order to compare them. Otherwise this leads to undefined behavior."

For the purpose of comparisons, C++ lets you partially work around this via std::less et al, which are nevertheless guaranteed to implemented in such a way as to give you a total ordering among all pointers. Even then, I don't think you can freely interchange pointers that compare "equal" in that sense; the same object may have pointers with different representations, all of which compare "equal" for ordering purposes but which don't have an identical representation. (See segmented memory models, for instance.) I also don't recall if C has a similar workaround (it might not).

lmm · on Dec 15, 2020

> Comparing pointers derived from entirely different objects (or arrays) isn't legit.

The quote from the standard in your article contradicts the part you quoted:

> Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.

But in any case the program in the article doesn't compare pointers; it casts the pointers to integers and then compares those integers, which is always legitimate.

dataflow · on Dec 15, 2020

They don't contradict to my understanding (see below).

First, regarding comparing pointers vs. integers, that goes back to what I said about identical pointer representations as an integer. Just because two pointers point to the same memory location, it does not mean they have an identical representation as an integer. This might be difficult to understand if you think of a linear address space, but it's easier to understand once you're familiar with segmentation. Pointers derived in different ways can ostensibly have different representations due to nonlinear (e.g., segmented) memory models, and yet nevertheless point to the same memory location. Hence it is actually undefined whether they compare equal. There is no canonicalization procedure that ensures they result in the same integer representation.

One pointer (not sure if pun intended) that might help you see this: consider words like "follows" aren't even necessarily well-defined for nonlinear address spaces—IIRC they're defined not with respect to the physical machine, but respect to struct/object layout in the language, implying there is a larger object/array containing both of these adjacent arrays to begin with. That's what they're talking about in that quote: if you start from a pointer to an array and go to another array that follows it, then it's fine to compare the pointer. To do that, you have to be able to first prove (via the language's abstract machine semantics—not by putting your linear DRAM under a microscope!) one follows the other one, and you cannot do that unless you know they're part of the same aggregate structure.

Anyway, all of this is my understanding of what's going on (and apparently others', as you see on that page). The entire page is a longer discussion on the meaning of these quotes. You can't ignore the rest of the entire page and just quote the first part, since this is exactly what they're dissecting. Read the whole thing, and look at the links they point to. What I quoted is the conclusion they come to based on the preceding quotes (including yours). The standard is not 100% explicit on the issue (hence the discussion and requests for clarification), but many of those who've looked into the issue so far have come to the conclusion I quoted. You need to read it extremely carefully, and (again) if you look at nonlinear address spaces, this should make more sense.

lmm · on Dec 15, 2020

> Just because two pointers point to the same memory location, it does not mean they have an identical representation as an integer.

Yes, I know. That's irrelevant here for two reasons: firstly, the pointers might compare equal (which is all that matters for this argument), and secondly, once again, the comparison isn't being made between pointers, the comparison is being made between integers.

> The entire page is a longer discussion on the meaning of these quotes. You can't ignore the rest of the entire page and just quote the first part, since this is exactly what they're dissecting. Read the whole thing, and look at the links they point to.What I quoted is the conclusion they come to based on the preceding quotes (including yours). The standard is not 100% explicit on the issue (hence the discussion and requests for clarification), but many of those who've looked into the issue so far have come to the conclusion I quoted.

Yeah, no, their conclusion is wrong (and frankly I don't consider those guys a reputable source; they've said plenty of false things before, and they're prone to exaggerating the extent of undefined behaviour because they have a vested interest in portraying it as more common than it is). They claim that making equality comparisons is undefined behaviour but their only basis for that is a quote from the section on relational operators which is obviously only talking about relational operators. Read the actual standard quotations in context and there is no confusion.

dataflow · on Dec 15, 2020

> Yes, I know. That's irrelevant here for two reasons: firstly, the pointers might compare equal (which is all that matters for this argument), and secondly, once again, the comparison isn't being made between pointers, the comparison is being made between integers.

You know what, I think you're right, at least on the second point. I guess the code isn't UB after all then. Somehow I kept reading it as if it's comparing pointers.

Regarding the other point about following something in memory:

> I don't consider those guys a reputable source [...] they claim that making equality comparisons is undefined behaviour but their only basis for that is a quote from the section on relational operators which is obviously only talking about relational operators.

There are multiple entities here, in agreement with each other. The author links to bug #61502 in GCC, where you'll see the compiler devs agree with the author of that page: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61502#c15

You might disagree with their stances for whatever reasons you might have, but firing reputational shots at them when there are clearly other experts disagreeing with them isn't really helping make your position more convincing.

> Read the actual standard quotations in context and there is no confusion.

There was no "confusion" on this point to begin with (on my end anyway). Again, as I explained: they might not be intuitive if you're assuming a linear address space, but they're quite fine to interpret. You just need to avoid bringing your own extra assumptions into the table. The notion of "following" in memory hinges on your assumption that the address space is linear. This is pretty similar to how to be able to say your house "follows" another house, you need to first make nontrivial assumptions (e.g., that they're on the same street). To my knowledge C does not make such assumptions globally; all it does is lets you satisfy that criterion by putting them in the same aggregate structure ("street") to begin with, but if you can't do that, you can't satisfy that assumption.

lmm · on Dec 15, 2020

> There are multiple entities here, in agreement with each other. The author links to bug #61502 in GCC, where you'll see the compiler devs agree with the author of that page: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61502#c15

No they don't. The previous comment from that compiler dev explicitly says that this comparison is not undefined behaviour. There is nothing in either the standard or that bug report that supports the claim that you originally posted ("Comparing pointers for equality is defined if both pointers are derived from the same (multidimensional) array object. Thus, if two pointers point to different array objects, then these array objects must be subaggregates of the same multidimensional array object in order to compare them. Otherwise this leads to undefined behavior."). Equality comparisons (not relational comparisons) of pointers to (unrelated) arrays are not and never have been undefined behaviour, and anyone claiming they are is at best confused.

dataflow · on Dec 15, 2020

Interesting. So I went back to look at the standard. Like you said, for the relational operators <, <=, >=, and >, they do explicitly say comparing unrelated objects is undefined behavior. But as you suggest, for == and != equality operators, they don't say anything about undefined behavior for comparing pointers to unrelated objects; in fact they do say that unrelated pointers can point to adjacent memory locations if the implementation "chooses" to place unrelated objects in that manner.

However, they also throw in this wrinkle: "The == and != operators are analogous to the relational operators except for their lower precedence."

Does the analogy extend to the UB-ness for different objects? (If not, wouldn't that seem to contradict this sentence?)

On the one hand, it would seem not, since they do say "if and only if" in that clause (which lacks any mention of UB).

On the other hand, I'm not sure how else to interpret that sentence... that sentence would seem superfluous if I just ignored it, since everything else seems to be spelled out already.

So I'm kind of at a loss now. Maybe you're right then, I don't know... at least you seem to be right about the fact that it is confusing. Possibly later standards clarify this? I haven't checked.

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf#pa...

lmm · on Dec 15, 2020

"Analogous" generally means similar but not the same; if it's a normative description it's underspecified, whereas the "if and only if" description of the semantics of == and != for pointers is definitely normative. I'm sure the standard could be clearer but I do think there's only one defensible interpretation.

dataflow · on Dec 15, 2020

It's underspecified, but it is normative. While it's not mathematically inconsistent for the sentence to be redundant/superfluous, it would be rather silly and misleading. It's reasonable to assume it's there for a reason (i.e. it makes a difference whether it's there or not). Given they spell out pretty much everything else anyway, it would seem this point is where it would make a difference.

Moreover, I have a hard time justifying this on first principles. It seems rather arbitrary to make relational operators UB but make equality operators well-defined for unrelated objects. After all, if you can tell if two pointers to unrelated objects are the same, that means you're willing and able to canonicalize them to a common-denominator representation, in which case you already have an ordering based on that same canonicalization too... why forbid that? It would make sense to prevent comparing pointers derived from unrelated objects (due to future performance & extensibility considerations), and it would certainly also make sense to allow it (more convenient/intuitive/etc.)... but to allow one but not the other just doesn't make sense as far as I can see it. You end up with something that's both unintuitive for the user and restrictive for the implementation. So it's a little tough to swallow that that's really what they intended.

lmm · on Dec 15, 2020

> After all, if you can tell if two pointers to unrelated objects are the same, that means you're willing and able to canonicalize them to a common-denominator representation, in which case you already have an ordering based on that same canonicalization too... why forbid that?

UB wasn't meant to forbid anything, it was meant to allow implementers to offer different behaviour. Implementers who want to represent pointers internally as integers and have comparisons just compare the underlying integers can do that. Implementers who have something more complicated e.g. segmented architectures are required to support equality comparison (which they can do easily by just having pointers to different segments always compare nonequal) but are not required to define an ordering between different segments.

To me it seems like the kind of compromise the standard makes all the time (which is not to say I agree with it). Requiring everyone to define an ordering on all kinds of pointers would be too burdensome for implementers (or maybe some implementers wanted to trap on unrelated pointer comparison, because if you're doing pointer arithmetic with unrelated pointers you've probably got a bug). Whereas making equality comparisons UB would be way too burdensome for users; it would mean you couldn't e.g. have a list of pointers and check whether a given pointer was in the list by searching through its elements.

dataflow · on Dec 15, 2020

> UB wasn't meant to forbid anything, it was meant to allow implementers to offer different behaviour.

That's unspecified behavior, not undefined behavior. Unspecified behavior can literally do anything or nothing at all, including aborting the program nondeterministically. That effectively forbids invocation of UB for the programmer since you can't reason about the program after it's invoked, unless you've verified your implementation has actually defined the behavior for you (despite not being required to). That's quite different from unspecified behavior where the implementation is required to pick some sane behavior (often among a set of acceptable behaviors) and stick with it in a self-consistent manner.

> Requiring everyone to define an ordering on all kinds of pointers would be too burdensome for implementers

I don't see how it'd be more burdensome, and I don't think burden is the issue here anyway. Implementers already have to implement canonicalization for equality comparisons, and they already have to implement casting to uintptr_t too. Just put the two together you get comparisons. One reason to not want to do this is that it might be too expensive to canonicalize to an integer under the hood (maybe it's complicated arithmetic or whatever), but if they're already willing and able to jump through all the hoops just for the sake of equality comparisons, ordering is hardly any different.

> (or maybe some implementers wanted to trap on unrelated pointer comparison, because if you're doing pointer arithmetic with unrelated pointers you've probably got a bug).

Not at all. You want to linear-search in a list and want equality to work for that? Well I want to binary search in an array/BST and need comparisons to work for that. This was such an obvious thing to want to do that C++ made it work out-of-the-box for std::less and whatnot.

lmm · on Dec 16, 2020

> That's unspecified behavior, not undefined behavior. Unspecified behavior can literally do anything or nothing at all, including aborting the program nondeterministically. That effectively forbids invocation of UB for the programmer since you can't reason about the program after it's invoked, unless you've verified your implementation has actually defined the behavior for you (despite not being required to). That's quite different from unspecified behavior where the implementation is required to pick some sane behavior (often among a set of acceptable behaviors) and stick with it in a self-consistent manner.

It's very hard to make trapping an acceptable implementation for unspecified behaviour while allowing the implementation to do the usual kind of reordering, so the standard doesn't try. That's the original intention behind e.g. null pointer dereference being undefined behaviour - implementations should be permitted to make null pointer dereference trap, but should also be permitted to reorder or optimize out pointer dereferences.

> Implementers already have to implement canonicalization for equality comparisons

No they don't - they can just make comparison return false for pointers of different types or from different segments. Canonicalisation is not the only way to do equality-comparison!

> and they already have to implement casting to uintptr_t too

But the result of that isn't required to have the same comparison semantics as pointers. E.g. if some of your pointers are aligned and you internally represent those without trailing zeroes (like the JVM does) and use that as the integer cast, then some different types of pointer end up casting to the same int, which is fine. But the standard requires those pointers to not compare equal as pointers, so you can't implement pointer comparison like that. (Well, you can implement >= and <= like that, because those comparisons are undefined behaviour. But you'll have cases where a <= b and b <= a but a != b).

> You want to linear-search in a list and want equality to work for that? Well I want to binary search in an array/BST and need comparisons to work for that.

Look, I'm not saying I agree with the standards committee here, I'm saying that it's a plausible compromise position for them to have taken. Not being able to do equality comparisons would be way more limiting for users than not being able to do relational comparisons. Having to implement relational comparisons would have been a bit more work for implementers than only having to implement equality comparisons.

Dr_Emann · on Dec 14, 2020

Your block of code can be (and absolutely is) optimized to output zero directly (in c, because the assembly is easier to read):

https://c.godbolt.org/z/qe3f4K

`*(&i+1)` dereferences the pointer "one past the end" which is UB.

TrianguloY · on Dec 14, 2020

And I agree, with an UB the original code is undefined and so any optimization is not right nor wrong, simply keep that UB.

But then, why such a long article for a code whose second line is a UB ('p+1' is undefined)?

fanf2 · on Dec 14, 2020

No, p+1 is OK because you are allowed to make a pointer one past the end in C (so that common pointer loops work as expected) but you cannot dereference them.

gizmo686 · on Dec 14, 2020

    uintptr_t ip = (uintptr_t)(p+1);

Is perfectly defined. What is undefined is converting ip back to a pointer and writing to it, which the original program never does.

TrianguloY · on Dec 14, 2020

Hmm, I think I understand now.

So now I think it's the first optimization the one that is wrong. If you replace a variable with another, don't you need to keep information of the original variable?

  I mean, if a==b and you do *a=1 you can replace it with *b=1, but then you need to keep the information that 'a' was written to, so other optimizations (the third one) don't think it wasn't. Or am I missing something else here?

Edit: sorry for the code block, otherwise the asterisks are removed.

Dylan16807 · on Dec 14, 2020

> If you replace a variable with another, don't you need to keep information of the original variable?

Note that ip and iq are integers. If you can't replace integers freely, it makes it really hard to optimize arithmetic.

Dr_Emann · on Dec 14, 2020

Because there is no UB in the original program.

It constructs a pointer to the "one past the end" element, but that is fine, and the original program never dereferences that pointer. Again: there is no UB in the original program.

magicalhippo · on Dec 15, 2020

There's clearly potential for pointer aliasing in the program, isn't that UB?

fanf2 · on Dec 14, 2020

“Provenance” is about the object that a pointer refers to. In your example, i and j are different objects and the language does not permit you to modify j using a pointer derived from i.

Your comparison with parallel loops is interesting because in the case of big arrays, all the pointers refer to the same object, so the compiler can’t use provenance or strict aliasing to prove that they don’t overlap. So you have to use the pragmas to give the compiler permission to treat them as non-aliasing.

sheepfleece · on Dec 14, 2020

``` *(&i+1)='1'; ```

This line is UB, no? You are referencing some random, if adjacent, block of memory. Anything can happen.

vlovich123 · on Dec 14, 2020

Does this have implications about the safety guarantees even safe Rust can make? It seems like incorrect optimization passes could result in bugs/security issues that are a byproduct of the compiler rather than the language itself. I wonder how big of an issue this is for Rust (relatively probably a bigger issue than for C/C++ where basic resource ownership bugs are more common).

jswrenn · on Dec 14, 2020

Yes, the pointer issues Ralf raises has implications about what operations Rust can make safe. An example Ralf posed to the safe transmute working group: a transmutation from a reference (i.e., a pointer with provenance) to one without (e.g., a "raw" pointer) or to a `usize` number cannot be safe. (Rust can safely provide these conversions via the `as` keyword, just not via the more general framework of transmutations).

More broadly: Rust is very sensitive to LLVM bugs that inadvertently generate UB. The longstanding issues with `noalias` optimizations is the prime example of this: https://stackoverflow.com/a/57259339/7501301

Dylan16807 · on Dec 14, 2020

How could transmutation from a reference to a usize not be safe? Do you mean "and back"?

steveklabnik · on Dec 14, 2020

Yes, absolutely. https://github.com/rust-lang/rust/issues/28728 is probably the most famous bug here, but many I-unsound bugs are similar.

This would be the case even without this. Compilers are programs. Programs have bugs.

gpm · on Dec 14, 2020

This isn't the first soundness/compilations bug, it won't be the last. This bug can be fixed, hopefully with some work this whole class of bugs can be made rarer. If you were relying on your safety guarantees to include a bug free compiler - rust isn't there yet and you should stop. If you're relying on a bug free happy path, rust is as close as any other language implementation (with one or two exceptions pointed out in sibling comments).

So, I mean, if you're arguing that C/C++ programmers are culturally less repulsed by bugs... you might be right about that. Other than that I think it's about the same scale for C/C++ and rust - a bug is a bug no matter what language you wrote it in.

The only fundamental implications I think it has is that maybe we need to put more of an emphasis on formalizing the memory model if we want to reduce the number of soundness bugs (something that a number of people are already doing some awesome work towards).

dbaupp · on Dec 14, 2020

It’s enough of an issue for C at least that a fully formally verified compiler exists: https://en.m.wikipedia.org/wiki/CompCert . That is, there’s proof that its (limited) optimisation passes are correct.

There’s other approaches to assurance of compiled code too, such as seL4 which has proofs that the binary itself is correct, I believe.

astrange · on Dec 15, 2020

Note that fuzzing CompCert has still found bugs in it - because the entire compiler actually isn't verified. They've extended the proof over time.

(Note that "formally verified" still does not mean "bug free", it just means it's a refinement of a proof that is also a program which can have bugs in it. And "formally" sounds like a weasel word.)

ralfj · on Dec 15, 2020

Yeah, there were some bugs in the frontend and I think one in the backend. But there were zero bugs in the optimization pipeline, which is where typically the most subtle bucks lurk.

For more details, see https://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf

dbaupp · on Dec 15, 2020

Ah, thanks for the clarification!

abra529 · on Dec 15, 2020

Commenters here as well as the author seem content with asserting that under C semantics, the code after the second optimization exhibits UB when it dereferences (p+1). Is it really UB?

Here's the snippet for reference:

    char p[1], q[1] = {0};
    uintptr_t ip = (uintptr_t)(p+1);
    uintptr_t iq = (uintptr_t)q;
    if (iq == ip) {
      *(p+1) = 10; // <- This line changed
      print(q[0]);
    }

From the article: "LLVM IR (just like C) does not permit memory accesses through one-past-the-end pointers." I think this is a mental shortcut we make that can lead us astray here. Of course, one-past-the-end pointers can't usually be dereferenced legally. But I think it's legal here because when (iq == ip), (p+1) points to a valid location. I don't see (at least in the C99 standard) why this code would be illegal.

I realize that the code is not meant to represent C but rather LLVM IR, but surely, if this snippet is legal C code, then LLVM can't treat it as if it were UB? So that would mean the third optimization is incorrect here.

ralfj · on Dec 15, 2020

> if this snippet is legal C code, then LLVM can't treat it as if it were UB?

It can in principle for the purpose of this example, since this is not the C code that the programmer originally wrote. This just means that if the snippet is legal C code, LLVM needs to do something extra during its translation to make this legal LLVM IR.

But I am also happy to consider a different example, where this is the original C program. Is this legal C code? To my knowledge, basically everyone agrees that the answer is "no, this is UB". That includes both compilers (that will happily "miscompile" this code) and all formal semantics for C that I have seen so far. So I think you are in the minority with your interpretation of the standard. It's a shame that the standard is not precise enough to properly settle this question...

If the standard were amended to explicitly make this code allowed in C, I think C compiler devs would revolt, as all major C compilers have some pretty crucial analysis that are fundamentally relying on this not being legal C code.

abra529 · on Dec 15, 2020

I see, thanks for the explanation. Just read your previous blog post that goes into more detail about this. [1] I don't know much about compilers, but I guess Defect Report #260 supports the interpretation that dereferencing q isn't the same as dereferencing (p+1), even inside an if block where p+1 == q: "[Implementations] may also treat pointers based on different origins as distinct even though they are bitwise identical." [2]

[1] https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html

[2] http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_260.htm

siraben · on Dec 14, 2020

> However, to make the argument that an optimization is correct, the exact semantics of LLVM IR (what the behavior of all possible programs is and when they have UB) needs to be documented.

This is a great point as to why formal semantics of programming languages matters. Even if an optimization seems "obviously" correct, finicky things like introducing the possibility of UB can, as the post outlines, cascade into compiler passes that change the program more and more drastically. The post mentions that one need not go overly formal with the semantics, but I'll demonstrate what could happen when one does.

One possible formulation of semantics is denotational semantics, which describes how to map a program source syntax into values, i.e. eval : Expr -> Val. So when we have an optimization opt : Expr -> Expr, the desired correctness property for that optimization is that

  Definition opt_correct (opt : Expr -> Expr) := forall (e : Expr), eval e = eval (opt e).

When we want to rewrite, say e + 0 ==> e, for any expression e, the correctness can be stated and proved

  Theorem e_plus_0_to_e_correct : opt_correct e_plus_0_to_e.

One claim in the blog post that correct optimization passes, e.g. opt1 and opt2 compose into another correct optimization pass can be stated as:

  Theorem opt_correct_compose (opt1, opt2 : Expr -> Expr) :
       opt_correct opt1 -> opt_correct op2 -> opt_correct (opt1 ∘ opt2).

Which means given any two optimization passes opt1 and opt2 such that they are correct, composing them preserves correctness. The proof is simply;

    eval (opt1 (opt2 e))
  = { since opt1 is correct }
    eval (opt2 e)
  = { since opt2 is correct }
    eval e

Dylan16807 · on Dec 14, 2020

Not sure why you wanted to spend the time to prove "correct optimizations can be combined", but I take issue with that proof. It only works for optimizations that give code exactly the same behavior, which is severely limiting.

thaumasiotes · on Dec 15, 2020

That's not a problem in the proof, it's part of the definition of a "correct optimization", given above.

Dylan16807 · on Dec 15, 2020

I count the definitions given for a proof as part of the proof.

thaumasiotes · on Dec 15, 2020

That is a serious mistake here, where the definition of a correct optimization is of significantly more interest than the result that they can be composed. We want to apply optimizations alone even more than we want to apply them together.

Putting things another way, this definition is clearly not given as part of a proof; it is given for its own sake, and the proof uses it.

Dylan16807 · on Dec 15, 2020

I wouldn't call slightly imprecise wording a "serious mistake".

I object to points at the part of the post. My objections are unchanged by what we call it.

mhh__ · on Dec 15, 2020

Exactly the same as in the as-is rule or optimisations explicitly allowed by the standard?

Dylan16807 · on Dec 15, 2020

There are two big issues.

One is that the as-is rule only says that code has to match a possible execution of the abstract machine. Let's say an optimization changes the address where a variable gets allocated. That's an extremely valid optimization, even though the program can observe the change. But that would make it fail the "eval e = eval (opt e)" rule in siraben's proof. The same for picking a different order to execute the functions in "f() + g()".

The other is optimizing around undefined behavior. The as-is rule only applies for valid inputs. Optimizing a loop by assuming you won't overflow would get rejected by that proof. So would optimizing code so that it won't segfault.

And depending on how exactly that eval test works, it might effectively mark every variable as volatile too.

siraben · on Dec 15, 2020

> But that would make it fail the "eval e = eval (opt e)" rule in siraben's proof. The same for picking a different order to execute the functions in "f() + g()".

This is really a question of how the semantics are formulated. The eval function I gave doesn't take into account an abstract machine so there is no notion of "variable allocation" or "final state" to check, the semantics doesn't account for it.

To scale it to a more realistic model with nondeterminism, heaps and so on, the semantics needs to be changed to a relational one. For instance, eval would now be a relation that relates two states of the machine, and a proof of correctness would be something like[0], which takes into account all possible states of the heap.

Equality would no longer used to relate two "equivalent" programs but rather some other equivalence relation with the properties one cares about, for instance two programs would be heap-equivalent if they have exactly the same effect on the heap, or UB-equivalent if they have possible UB at the "same" (again under another relation) places.

[0] https://softwarefoundations.cis.upenn.edu/plf-current/Equiv....

samatman · on Dec 14, 2020

Excellent article.

It seems to me (and this is somewhat off-the-cuff) that compilers have another option: integers which are cast from pointers have provenance.

That is, provenance is a taint: you can't clean it off through casting. So casting from pointer means the user can do integer-things like addition, but it means the compiler can't do integer-things like constant folding.

gizmo686 · on Dec 14, 2020

I don't think that works. Consider the following (contrived) program:

  char* q[1] = {0};
  int iq = (uintptr_t)q;
  int ip = 0;
  while (iq>0) {
    iq--;
    ip++;
  }
  char* p = (char*) ip;

You can make this more efficient (albeit no less contrived) by iterating through iq bit-wise.

Less contrived would be sending a pointer through some IPC mechanism (allowing it to be used as an opaque handle externally, while the program will blindly dereference it when it is passed back in); or even passing it within a program across compilation units.

If you treat provenance like a taint, you need to track every way it could be propagated, and that does not seem solvable.

saagarjha · on Dec 15, 2020

You can solve it by deoptimizing accesses to q once it has been casted to an integer.

samatman · on Dec 14, 2020

So what are the optimizations here? iq is tainted, you can't hoist the loop out. You've said "I'm doing a weird thing here with an integer which used to be a pointer, you have to assume that it involves memory and treat it more like a pointer than an integer"

tsimionescu · on Dec 15, 2020

Thinking about this quite a bit, I see the second optimization as pretty obviously incorrect - not because of provenance, but because it acts as if pointer-to-integer-to-pointer casting is value-preserving, which is not at all guaranteed by the C abstract machine. In particular, `p != q && ((uintptr_t)p == (uintptr_t)q)` can be true according to the C standard. Instead of trying to track provenance, simply considering that the transformation between pointers and integers is not value preserving seems to be a simpler rule.

If we want to reason about optimizations using the C abstract machine, we can't consider these transformations to be value-preserving, even if they happen to be in the real machines that LLVM targets (and if they start reasoning for the real machine instead of the C abstract machine, a lot of other optimizations are invalid).

magicalhippo · on Dec 15, 2020

For me it is the third optimization which is incorrect. I mean, if you replace the pointers with array indexes, then you'd end up with something like this after the second optimization:

    char data[N];
    uintptr_t p = rand_int(N);
    uintptr_t q = rand_int(N);
    data[q] = 0;
    data[p] = 0;
    uintptr_t ip = (p+1);
    uintptr_t iq = q;
    if (iq == ip) {
      data[p+1] = 10;
      print(data[q]);
    }

Surely no sane compiler would replace the "data[q]" in the print statement with 0?

Or did the caffeine not kick in yet?

tsimionescu · on Dec 15, 2020

You're right, and the optimization wouldn't have kicked in in the original example if p and q had been pointers to elements of the same array.

However, if they are pointers to different local variables, the compiler knows something else: it knows that p and q do not alias each other, so writes to p can't change the value pointed by q and vice versa.

Now, in the original code, there seems to be a write to a pointer casted from an integer, which would be allowed to alias any other pointer - so after the (int)ip = 10 write, both p's and q's value would normally have to be assumed to be potentially changed.

However, step 2 in the optimizations has removed this information - by replacing the integer pointer with a write to the p pointer, it has again enabled the compiler to reason that no (valid) writes to q have been performed, so q's vue can't have changed.

Note that the intermediate program after optimization 2 would have UB if written directly in C, since dereferencing one-past-the-end pointers is UB (creating them, comparing them, or casting them to uintptr_t is valid behavior, however).

Instead, if p and q pointed into the same object, such as an array, p+1 could very well be an alias for q, so there would be no way to prove that q's value hasn't changed.

What the article is getting at is that the compiler is relying on the origin of the pointers to allow it to make optimization 3 (it knows p+1 is a pointer to a local variable, q is a pointer to a different local variable, so they can't be aliased) ; but it previously optimized away a code change that is supposed to destroy this provenance information - a pointer-from-integer is allowed to alias any pointer in the whole program (if we ignore restrict/noalias).

magicalhippo · on Dec 16, 2020

Thanks, that's making it more clear.

abra529 · on Dec 15, 2020

Is it not guaranteed? Looking at this question [1] on StackOverflow and the quotes from the standard included in the question, it seems that the value should be preserved.

(It raises the issue of whether (char※)iq should technically be (char※)(void※)iq if this were C code, but the (void※) cast can be inserted without changing anything about the problems discussed in the article.)

[1] https://stackoverflow.com/q/34291377

tsimionescu · on Dec 15, 2020

Yes, you are absolutely right, so my argument flies out the window. Thanks for finding the relevant C spec!

brundolf · on Dec 14, 2020

Before reading this I thought working on compilers might be a fun career direction

jcranmer · on Dec 15, 2020

To be fair, this kind of complication is relatively rare in compiler work.

On the other hand, when it does come up, it does tend to create several hours-long meetings where there are more opinions on what the "obvious" semantics should be than people present. I've always been a stickler for trying to actually specify the obvious semantics to try to forestall these conversations.

ralfj · on Dec 15, 2020

Seeing how long the story of poison/undef is dragging out for LLVM, I'd say it's more of a years-long process than an hours-long one. ;) But still, the most important part is that there's a discussion at all, and the desire to properly solve this problem.

jcranmer · on Dec 15, 2020

Oh, finishing the meetings may take months or years, but the meetings themselves are only a few hours long.

lambdatronics · on Dec 15, 2020

If more of the literature were written like this, I'd be more inclined to it.

dataflow · on Dec 14, 2020

If you enjoyed learning about pointer provenance (called "safely-derived pointer" in C++ [1]), you might find the recent addition of std::launder() [2] interesting.

[1] https://en.cppreference.com/w/cpp/memory/gc/pointer_safety

[2] https://en.cppreference.com/w/cpp/utility/launder

comex · on Dec 14, 2020

Pointer safety is actually a separate concept from provenance. It was added to C++11 with the intent that it be used by garbage-collected implementations, but I'm not sure if anyone has ever properly implemented it. At least, the big three STL implementations have get_pointer_safety() always return pointer_safety::relaxed ("All pointers are considered valid and may be dereferenced or deallocated"), and there has been a proposal to remove pointer safety from future versions of the standard. [1]

[1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p218...

dataflow · on Dec 14, 2020

That seems like a non-sequitur. How do "the intention is GC support" and "the C++ standard may remove concept" imply "these two are not the same concept"?

To my knowledge safely-derived pointers are entirely about the notion that the derivation of a pointer matters, not just its value. Which is why you can't legally subtract pointers belonging to different arrays, for example. Which is precisely what is the same concept that provenance refers too... right? I don't see any differences pointed out in my readings or in your comment.

andrewflnr · on Dec 15, 2020

For purposes of future language design: why not just ban conversion from integers to pointers? Pointers have metadata that integers can't provide, therefore the conversion is impossible, QED. What's the use case for it outside of, say, binary executable loaders?

adwn · on Dec 15, 2020

> What's the use case for it outside of, say, binary executable loaders?

If you're doing low-level optimizations, and you know that your pointed-to objects are, say, 8-byte-aligned, then you can use the lower 3 bits of your pointer for storing metadata. In certain cases, that can be immensely useful. Some GC implementations use this trick, for example.

ralfj · on Dec 15, 2020

One could imagine a specific primitive operation in the language that lets you adjust the low bits of a pointer without casting it to an integer and back.

fjfaase · on Dec 15, 2020

I think that the main problem is that pointers are a too low-level concept (like assembly language compared to high-level programming languages) that are used to implement various more high-level concepts. I have not come across a language that does have pointers (some functional languages lack pointers all together) and has these high-level concepts. There is a great difference between a pointer that points to a complex value and a pointer, that like a database cursor, points to a certain part in a complex value, either with the purpose to query this value or to make some change to the value.

tsimionescu · on Dec 15, 2020

I wonder what the point of the first optimization in this specific case was. Why is LLVM replacing iq with ip in the body? I'm not questioning why it should be allowed to, but I don't understand why it would do it in this particular case.

I also don't understand why that optimization is necessary to surface the problem.

If we simply had a program that was doing

  char p[1], q[1] = {0};
  uintptr_t ip = (uintptr_t)(p+1);
  *(char*)ip = 10;
  print(q[0]);

Wouldn't applying just optimizations 2 and 3 on this code surface the same bug as discussed in the article?

deathanatos · on Dec 14, 2020

I am not 100% convinced on the third transformation,

> The final optimization notices that q is never written to, so we can replace q[0] by its initial value 0:

Can we? q is at a language level, possibly aliased with the write immediately above to (p+1), and we know it's aliased because of the if statement.

Now, that's a C rule, and the article does note that it is only using C syntax to express LLVM. So, I guess, what are LLVM's aliasing rules?

(Indeed, there was originally a write to q, which we replaced with an aliased write to q, so it seems to me that on the whole, the various optimizations are assuming different things about aliasing.)

lmm · on Dec 15, 2020

> Can we? q is at a language level, possibly aliased with the write immediately above to (p+1)

It's not, because writing to a one-past-the-end pointer is UB.

> Indeed, there was originally a write to q, which we replaced with an aliased write to q, so it seems to me that on the whole, the various optimizations are assuming different things about aliasing.

That optimization pass doesn't know anything about aliasing, it just replaced an integer with another integer that's equal to the first.

Ericson2314 · on Dec 15, 2020

I think the best solution is either `iq == ip` must be rewritten to false, or in the body of the branch the p and q regions are concatenated and then the alias analysis doesn't work.

In general I like an approach of starting with the optomizations we want to do, and other sources non-determinism such as arbitrary layout we want to support, and then working backwards to fine the preconditions are program must abide by in order to for the compilation to not go wrong.

Asooka · on Dec 15, 2020

Personally I just compile my code with "-fwrapv -fno-strict-aliasing". It's not standard C, but every compiler supports an analogue to those class and it saves so many headaches from UB-optimising passes, the tiny loss in performance is worth it. Integer overflow and pointer aliasing are way too easy to accidentally hit in a C program, the two should have never been made UB, but perhaps added as a pragma to annotate definitely safe code that can benefit from the optimisation.

ralfj · on Dec 15, 2020

Note that all of the optimizations I used in the main part (not the warm-up) of my post are still performed even with "-fwrapv -fno-strict-aliasing". So this does not avoid the issues I am talking about.

zokier · on Dec 14, 2020

Intuitively I feel that the first optimization is incorrect as it introduces out of bound write. I'll admit that this is very much subjective thing

einpoklum · on Dec 15, 2020

I remember a CppCon talk which covered how poison and UB work in LLVM IR, probably in 2018 or 2019, but I forget which one it was.

raverbashing · on Dec 14, 2020

Great article

It just seems to me there's too much risk in "overoptimizing" especially in a weakly defined language like C with trigger happy optimizers and even more trigger happy "UB means let's go crazy"

I take it as principle that no compilers should "throw their hands up" when detecting UB. Crash the program, don't just take that part out.

gpderetta · on Dec 14, 2020

Compilers don't detect UB (at least as optimizations are concerned), they assume no UB.

If you want to detect UB, at least dynamically, compile with sanitizers.

pierrebai · on Dec 14, 2020

The bug is clearly the writing past the end of an array. That is:

    *(p+1) = 10;

If you start with a buggy program, that it stays buggy after optimizations is not surprising!

Edit: what;s more, the original form of the program also did not write through the q pointer and could have had optimizer the print zero. In the first form, the author assumes the compiler can track the pointer q through q == qi and avoid the optimization that would print zero, but fail to track pointer q through pi + 1 == qi == q.

mskslal · on Dec 14, 2020

No, the original program writes through iq, which is well defined. Only the optimized program writes through p+1.

gens · on Dec 14, 2020

Pointers are "just" integers (unsigned integers of a size, that is). Most languages treat them differently (C assigns them a type, and a stride with it), but that is usually on top of them being integers.

Pointers "point" to (are addresses to) bytes, as he said. If you want you can pack data to a resolution of a bit, and some languages help you do that as well.

You can also do whatever you want with pointers (in some languages, ofc), like do bitwise operations on them. Aliasing them can be useful as well, in more then one way.

Why not just get rid of pointers ? You know you can.

Also, a byte is 0 .. 255. The hardest problem; naming things, and off by one errors.

steveklabnik · on Dec 14, 2020

They are not always integers. See this comment from the previous discussions on the other posts, for example: https://news.ycombinator.com/item?id=17607595

badsectoracula · on Dec 14, 2020

> What happens when 'a' is allocated to a CPU register?

When the & operator is used on 'a' it should mark it as unsafe to place the 'a' on a CPU register - CPU allocation is an optimization and optimizations should never affect how the program behaves (except making it run faster, of course) and as such they should only be applied when the compiler can be sure that they're safe to do so.

(and yes, the same applies on using & on something like an array or struct element - the use of & should taint any variable)

MaxBarraclough · on Dec 14, 2020

> When the & operator is used on 'a' it should mark it as unsafe to place the 'a' on a CPU register

Not necessarily, as the compiler might be smart enough to adjust the other operations and enregister the a variable anyway. There's no ceiling on how smart the optimiser can be. An extreme case:

    int a = 42;
    &a;
    // do things with 'a'

This uses the & operator but doesn't even save the result of the expression, so the compiler will presumably chop that whole second statement as dead code, and may still be able to enregister a.

> optimizations should never affect how the program behaves (except making it run faster, of course)

Most of the time a C++ compiler's optimiser must preserve observable behaviour, provided undefined behaviour is not invoked, but not always. C++ permits elision of certain copy/move operations even if this changes observable behaviour. [0]

Also, if the program is multithreaded and has race conditions, an optimiser isn't required to ensure that the relative speeds of different threads remains unchanged, which may lead to a change in observed behaviour. Of course, in such a case, except making it run faster contradicts never affect how the program behaves, so you've sort of covered that anyway.

[0] https://stackoverflow.com/a/12953129/

badsectoracula · on Dec 15, 2020

Sure, if a compiler can do that with 100% certainty, it can use a register for it. However that '100%' comes after it treats &a as a pointer until the very moment it is '100%' sure.

Also FWIW i was referring to C, not C++. C pointer optimizations can easily chop off your leg, C++ basically adds poison just in case you survive the blood loss.

steveklabnik · on Dec 14, 2020

That's addressed by the

> If 'b' never escapes and no (undefined) pointer arithmetic is used

part.

gens · on Dec 14, 2020

If 'b' is actually used then the compiler has to put 'a' in memory. Theoretically. He says that b is not changed, and does not "escape" (the scope of the code in question). In that case b is useless, and thus it doesn't really even exist. A compiler can "collapse" a lot of code into a fraction of it. That's a part of the "optimizing" part of "optimizing compiler". A compiler can put a value in a register. A compiler can do whatever it wants as long as the results are always the same as if it didn't do anything other then just translate to machine code.

Pointer aliasing in C can be bad for the compiler as it can't be sure that it can optimize away something. That's why Fortran is faster, and why the restrict keyword was added to the C standard.

And yea, i agree with him completely that a context should be made clear. Personally i'd like less voodoo in programming topics, but, on the other hand, this is still a young science and a lot of terminology is all around the place. ..Actually, scratch that, i found what i was interested in and don't really care about everything else.

klodolph · on Dec 14, 2020

> If 'b' is actually used then the compiler has to put 'a' in memory.

Theoretically, no—the “as if” rule applies. In practice, also no, because real compilers perform escape analysis and there are aggressive optimization passes for eliminating dead loads and dead stores.

Just like the compiler can optimize out b and replace it with a reference to a, the compiler can optimize out the value stored in a but keep the reference to it stored in b.

> In that case b is useless, and thus it doesn't really even exist.

I don’t think this notion of “exist” has legs.

gens · on Dec 14, 2020

That is what i said, that it can be optimized away.

"exist" is the wrong word, "is useless" would be better. It is like.. you have a spanner in a box labeled "a" and you have a box "b" that has a piece of paper that says "look in a". You ask for a spanner and you get told to look in box "a". The box labeled "b" is then completely useless. One day some guy, weirdly named "compiler", decides to trow away box "b", and nobody cares.

But we have to assume that a compiler will compile the program just as we wrote it. Otherwise we are not talking about the language, but about the compiler. This is gone off topic, if there even was one.

klodolph · on Dec 14, 2020

A variable is useless because it is optimized away? I don’t agree with that terminology.

The variable is useful / exists in the source code, and that is enough. Variables / objects only exist conceptually in the original program anyways, we just infer their existence in the compiled program by correspondence with the original code, or educated guesses about the original code.