But this is LLVM IR, not C. It feels like a trick question (hear me out). Either with IR semantics the code is already UB (as in C), or it isn't. If the code isn't UB, then why? The obvious assumption is that it's because LLVM IR treats this as a model of a concrete machine (where provenance isn't a thing), not as a C-like abstract machine with UB. In which case, the optimization #3 is clearly invalid. OTOH, if it is already UB, then the optimizations are fine, right?
The only way I can see that we can implicate optimization #1 or #2 is by claiming LLVM IR has neither C semantics nor machine semantics, but with some kind of in-between wobbly semantics. There's nothing fundamentally wrong with that, but that's not what a reader would assume in the absence of other information. Especially not when the author writes "this program has two possible behaviors"—this clearly tells us there's no UB and that we should assume this is based on concrete machine semantics (where provenance isn't a thing). So, I would also say, given the info in the post prior to the conclusion, #3 has to be the culprit. I just can't implicate #1 or #2 based on the preceding text; to implicate those you'd have to pull assumptions out of thin air.
> Either with IR semantics the code is already UB (as in C), or it isn't. If the code isn't UB, then why?
With C-like semantics the code isn't UB. Why would it be? Taking a one-past-the-end pointer is legit, comparing it is legit, casting a pointer to an integer is legit.
"Pointer arithmetic is only defined for pointers pointing into array objects or one past the last element. Comparing pointers for equality is defined if both pointers are derived from the same (multidimensional) array object. Thus, if two pointers point to different array objects, then these array objects must be subaggregates of the same multidimensional array object in order to compare them. Otherwise this leads to undefined behavior."
For the purpose of comparisons, C++ lets you partially work around this via std::less et al, which are nevertheless guaranteed to implemented in such a way as to give you a total ordering among all pointers. Even then, I don't think you can freely interchange pointers that compare "equal" in that sense; the same object may have pointers with different representations, all of which compare "equal" for ordering purposes but which don't have an identical representation. (See segmented memory models, for instance.) I also don't recall if C has a similar workaround (it might not).
> Comparing pointers derived from entirely different objects (or arrays) isn't legit.
The quote from the standard in your article contradicts the part you quoted:
> Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.
But in any case the program in the article doesn't compare pointers; it casts the pointers to integers and then compares those integers, which is always legitimate.
They don't contradict to my understanding (see below).
First, regarding comparing pointers vs. integers, that goes back to what I said about identical pointer representations as an integer. Just because two pointers point to the same memory location, it does not mean they have an identical representation as an integer. This might be difficult to understand if you think of a linear address space, but it's easier to understand once you're familiar with segmentation. Pointers derived in different ways can ostensibly have different representations due to nonlinear (e.g., segmented) memory models, and yet nevertheless point to the same memory location. Hence it is actually undefined whether they compare equal. There is no canonicalization procedure that ensures they result in the same integer representation.
One pointer (not sure if pun intended) that might help you see this: consider words like "follows" aren't even necessarily well-defined for nonlinear address spaces—IIRC they're defined not with respect to the physical machine, but respect to struct/object layout in the language, implying there is a larger object/array containing both of these adjacent arrays to begin with. That's what they're talking about in that quote: if you start from a pointer to an array and go to another array that follows it, then it's fine to compare the pointer. To do that, you have to be able to first prove (via the language's abstract machine semantics—not by putting your linear DRAM under a microscope!) one follows the other one, and you cannot do that unless you know they're part of the same aggregate structure.
Anyway, all of this is my understanding of what's going on (and apparently others', as you see on that page). The entire page is a longer discussion on the meaning of these quotes. You can't ignore the rest of the entire page and just quote the first part, since this is exactly what they're dissecting. Read the whole thing, and look at the links they point to. What I quoted is the conclusion they come to based on the preceding quotes (including yours). The standard is not 100% explicit on the issue (hence the discussion and requests for clarification), but many of those who've looked into the issue so far have come to the conclusion I quoted. You need to read it extremely carefully, and (again) if you look at nonlinear address spaces, this should make more sense.
> Just because two pointers point to the same memory location, it does not mean they have an identical representation as an integer.
Yes, I know. That's irrelevant here for two reasons: firstly, the pointers might compare equal (which is all that matters for this argument), and secondly, once again, the comparison isn't being made between pointers, the comparison is being made between integers.
> The entire page is a longer discussion on the meaning of these quotes. You can't ignore the rest of the entire page and just quote the first part, since this is exactly what they're dissecting. Read the whole thing, and look at the links they point to.What I quoted is the conclusion they come to based on the preceding quotes (including yours). The standard is not 100% explicit on the issue (hence the discussion and requests for clarification), but many of those who've looked into the issue so far have come to the conclusion I quoted.
Yeah, no, their conclusion is wrong (and frankly I don't consider those guys a reputable source; they've said plenty of false things before, and they're prone to exaggerating the extent of undefined behaviour because they have a vested interest in portraying it as more common than it is). They claim that making equality comparisons is undefined behaviour but their only basis for that is a quote from the section on relational operators which is obviously only talking about relational operators. Read the actual standard quotations in context and there is no confusion.
> Yes, I know. That's irrelevant here for two reasons: firstly, the pointers might compare equal (which is all that matters for this argument), and secondly, once again, the comparison isn't being made between pointers, the comparison is being made between integers.
You know what, I think you're right, at least on the second point. I guess the code isn't UB after all then. Somehow I kept reading it as if it's comparing pointers.
Regarding the other point about following something in memory:
> I don't consider those guys a reputable source [...] they claim that making equality comparisons is undefined behaviour but their only basis for that is a quote from the section on relational operators which is obviously only talking about relational operators.
There are multiple entities here, in agreement with each other. The author links to bug #61502 in GCC, where you'll see the compiler devs agree with the author of that page: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61502#c15
You might disagree with their stances for whatever reasons you might have, but firing reputational shots at them when there are clearly other experts disagreeing with them isn't really helping make your position more convincing.
> Read the actual standard quotations in context and there is no confusion.
There was no "confusion" on this point to begin with (on my end anyway). Again, as I explained: they might not be intuitive if you're assuming a linear address space, but they're quite fine to interpret. You just need to avoid bringing your own extra assumptions into the table. The notion of "following" in memory hinges on your assumption that the address space is linear. This is pretty similar to how to be able to say your house "follows" another house, you need to first make nontrivial assumptions (e.g., that they're on the same street). To my knowledge C does not make such assumptions globally; all it does is lets you satisfy that criterion by putting them in the same aggregate structure ("street") to begin with, but if you can't do that, you can't satisfy that assumption.
> There are multiple entities here, in agreement with each other. The author links to bug #61502 in GCC, where you'll see the compiler devs agree with the author of that page: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61502#c15
No they don't. The previous comment from that compiler dev explicitly says that this comparison is not undefined behaviour. There is nothing in either the standard or that bug report that supports the claim that you originally posted ("Comparing pointers for equality is defined if both pointers are derived from the same (multidimensional) array object. Thus, if two pointers point to different array objects, then these array objects must be subaggregates of the same multidimensional array object in order to compare them. Otherwise this leads to undefined behavior."). Equality comparisons (not relational comparisons) of pointers to (unrelated) arrays are not and never have been undefined behaviour, and anyone claiming they are is at best confused.
Interesting. So I went back to look at the standard. Like you said, for the relational operators <, <=, >=, and >, they do explicitly say comparing unrelated objects is undefined behavior. But as you suggest, for == and != equality operators, they don't say anything about undefined behavior for comparing pointers to unrelated objects; in fact they do say that unrelated pointers can point to adjacent memory locations if the implementation "chooses" to place unrelated objects in that manner.
However, they also throw in this wrinkle: "The == and != operators are analogous to the relational operators except for their lower precedence."
Does the analogy extend to the UB-ness for different objects? (If not, wouldn't that seem to contradict this sentence?)
On the one hand, it would seem not, since they do say "if and only if" in that clause (which lacks any mention of UB).
On the other hand, I'm not sure how else to interpret that sentence... that sentence would seem superfluous if I just ignored it, since everything else seems to be spelled out already.
So I'm kind of at a loss now. Maybe you're right then, I don't know... at least you seem to be right about the fact that it is confusing. Possibly later standards clarify this? I haven't checked.
"Analogous" generally means similar but not the same; if it's a normative description it's underspecified, whereas the "if and only if" description of the semantics of == and != for pointers is definitely normative. I'm sure the standard could be clearer but I do think there's only one defensible interpretation.
It's underspecified, but it is normative. While it's not mathematically inconsistent for the sentence to be redundant/superfluous, it would be rather silly and misleading. It's reasonable to assume it's there for a reason (i.e. it makes a difference whether it's there or not). Given they spell out pretty much everything else anyway, it would seem this point is where it would make a difference.
Moreover, I have a hard time justifying this on first principles. It seems rather arbitrary to make relational operators UB but make equality operators well-defined for unrelated objects. After all, if you can tell if two pointers to unrelated objects are the same, that means you're willing and able to canonicalize them to a common-denominator representation, in which case you already have an ordering based on that same canonicalization too... why forbid that? It would make sense to prevent comparing pointers derived from unrelated objects (due to future performance & extensibility considerations), and it would certainly also make sense to allow it (more convenient/intuitive/etc.)... but to allow one but not the other just doesn't make sense as far as I can see it. You end up with something that's both unintuitive for the user and restrictive for the implementation. So it's a little tough to swallow that that's really what they intended.
> After all, if you can tell if two pointers to unrelated objects are the same, that means you're willing and able to canonicalize them to a common-denominator representation, in which case you already have an ordering based on that same canonicalization too... why forbid that?
UB wasn't meant to forbid anything, it was meant to allow implementers to offer different behaviour. Implementers who want to represent pointers internally as integers and have comparisons just compare the underlying integers can do that. Implementers who have something more complicated e.g. segmented architectures are required to support equality comparison (which they can do easily by just having pointers to different segments always compare nonequal) but are not required to define an ordering between different segments.
To me it seems like the kind of compromise the standard makes all the time (which is not to say I agree with it). Requiring everyone to define an ordering on all kinds of pointers would be too burdensome for implementers (or maybe some implementers wanted to trap on unrelated pointer comparison, because if you're doing pointer arithmetic with unrelated pointers you've probably got a bug). Whereas making equality comparisons UB would be way too burdensome for users; it would mean you couldn't e.g. have a list of pointers and check whether a given pointer was in the list by searching through its elements.
> UB wasn't meant to forbid anything, it was meant to allow implementers to offer different behaviour.
That's unspecified behavior, not undefined behavior. Unspecified behavior can literally do anything or nothing at all, including aborting the program nondeterministically. That effectively forbids invocation of UB for the programmer since you can't reason about the program after it's invoked, unless you've verified your implementation has actually defined the behavior for you (despite not being required to). That's quite different from unspecified behavior where the implementation is required to pick some sane behavior (often among a set of acceptable behaviors) and stick with it in a self-consistent manner.
> Requiring everyone to define an ordering on all kinds of pointers would be too burdensome for implementers
I don't see how it'd be more burdensome, and I don't think burden is the issue here anyway. Implementers already have to implement canonicalization for equality comparisons, and they already have to implement casting to uintptr_t too. Just put the two together you get comparisons. One reason to not want to do this is that it might be too expensive to canonicalize to an integer under the hood (maybe it's complicated arithmetic or whatever), but if they're already willing and able to jump through all the hoops just for the sake of equality comparisons, ordering is hardly any different.
> (or maybe some implementers wanted to trap on unrelated pointer comparison, because if you're doing pointer arithmetic with unrelated pointers you've probably got a bug).
Not at all. You want to linear-search in a list and want equality to work for that? Well I want to binary search in an array/BST and need comparisons to work for that. This was such an obvious thing to want to do that C++ made it work out-of-the-box for std::less and whatnot.
> That's unspecified behavior, not undefined behavior. Unspecified behavior can literally do anything or nothing at all, including aborting the program nondeterministically. That effectively forbids invocation of UB for the programmer since you can't reason about the program after it's invoked, unless you've verified your implementation has actually defined the behavior for you (despite not being required to). That's quite different from unspecified behavior where the implementation is required to pick some sane behavior (often among a set of acceptable behaviors) and stick with it in a self-consistent manner.
It's very hard to make trapping an acceptable implementation for unspecified behaviour while allowing the implementation to do the usual kind of reordering, so the standard doesn't try. That's the original intention behind e.g. null pointer dereference being undefined behaviour - implementations should be permitted to make null pointer dereference trap, but should also be permitted to reorder or optimize out pointer dereferences.
> Implementers already have to implement canonicalization for equality comparisons
No they don't - they can just make comparison return false for pointers of different types or from different segments. Canonicalisation is not the only way to do equality-comparison!
> and they already have to implement casting to uintptr_t too
But the result of that isn't required to have the same comparison semantics as pointers. E.g. if some of your pointers are aligned and you internally represent those without trailing zeroes (like the JVM does) and use that as the integer cast, then some different types of pointer end up casting to the same int, which is fine. But the standard requires those pointers to not compare equal as pointers, so you can't implement pointer comparison like that. (Well, you can implement >= and <= like that, because those comparisons are undefined behaviour. But you'll have cases where a <= b and b <= a but a != b).
> You want to linear-search in a list and want equality to work for that? Well I want to binary search in an array/BST and need comparisons to work for that.
Look, I'm not saying I agree with the standards committee here, I'm saying that it's a plausible compromise position for them to have taken. Not being able to do equality comparisons would be way more limiting for users than not being able to do relational comparisons. Having to implement relational comparisons would have been a bit more work for implementers than only having to implement equality comparisons.
The only way I can see that we can implicate optimization #1 or #2 is by claiming LLVM IR has neither C semantics nor machine semantics, but with some kind of in-between wobbly semantics. There's nothing fundamentally wrong with that, but that's not what a reader would assume in the absence of other information. Especially not when the author writes "this program has two possible behaviors"—this clearly tells us there's no UB and that we should assume this is based on concrete machine semantics (where provenance isn't a thing). So, I would also say, given the info in the post prior to the conclusion, #3 has to be the culprit. I just can't implicate #1 or #2 based on the preceding text; to implicate those you'd have to pull assumptions out of thin air.