It doesn't actively push you there, which is why it's on the bottom half. There is a fairly coherent mental model for C and once you have it, the language rarely surprises you.
But when it does surprise you, it's almost never anything good. Stuff like Duff's device, `3[someArray] = value`, etc. The surprises are always the language's raw machinery showing through in unpleasant ways, and never a delightful bonus feature the designers added for you.
> The surprises are always the language's raw machinery showing through in unpleasant ways, and never a delightful bonus feature the designers added for you.
I was reading "Writing Solid Code" by Steve Maguire (though it should really be called "how to write code in C without shooting yourself in the foot"). One thing that surprised me, but which made sense, was that pointers can overflow. It's unlikely, and ANSI non-compliant, but something possible nonetheless. Hence, Maguire said that this code:
Has a bug. It surprised me when reading it, because it's such a common language idiom.
"What range of memory would memchr search when pv points to the last 72 bytes of memory and size is also 72? If you said 'all of memory, over and over,' you're right. Those versions of memchr go into an infinite loop because they use a risky language idiom—and Risk wins."
So, he said that that code should be replaced with this code:
Gone through a whole range of emotions looking at this. Tried to put together an argument about the high bits of (common) address spaces being zero and therefore it's safe but I don't think that works. It's the while (i++ < UINT32_MAX) bug in different clothing. Would make for a cruel take on the interview question of "tell me what bugs you see in this function".
>Tried to put together an argument about the high bits of (common) address spaces being zero and therefore it's safe but I don't think that works.
Yep; on AMD64, bits 48 through 63 must be identical to bit 47, which can be 1 or 0, akin to sign extension.
In practice, I don't think any sane OS would let you reserve the very last n bytes of memory, especially not with an address space as large as that of AMD64, but you can't assume the architecture, and you don't always have an operating system.
And yeah, you could see the same bug with integer array indices, if signed integers wrap.
/* Endless loop if end == INT_MAX */
for (int i = 0; i < end; ++i)
/* code */;
Linux kernel space on x64 uses "negative" pointer values so the high bits are set there. Which is probably the more interesting place to find this bug.
Needs to be unsigned to get this failure mode, any signed loop gets compiled assuming no overflow for a different (though similar!) failure mode.
I'm leaning towards "C is surprising". Didn't have to be but as presently implemented is very full of hazards.
> Needs to be unsigned to get this failure mode, any signed loop gets compiled assuming no overflow for a different (though similar!) failure mode.
In C and C++, compilers will optimize assuming that signed integer overflow doesn't happen, but that doesn't stop it from actually happening. Unless you set it to trap on overflow, signed integers still wrap; it's just that compilers make (incorrect) optimizations assuming that it doesn't happen.
You know, this makes me wonder whether it's better pointers to be compared with signed or unsigned comparisons. Currently, my compiler emits unsigned comparison instructions for them.
Do you mean unsigned branches? JB/JBE/JA/JAE instead of JL/JLE/JG/JGE? Are there actually code patterns where it's preferable to just JE/JNE? AFAIK, computing a pointer outside of the underlying array's boundaries (except for computing the one-past-the-last-element pointer) is UB so e.g.
for (SomeStruct * curr = p, * end = &p[N]; curr < end; curr += 2) {
// processing the pair of curr[0] and curr[1], with care
// in case when curr[1] doesn't exist
}
is an invalid optimization of
for (SomeStruct * curr = p, * end = &p[N]; curr != end; ) {
// processing the pair of curr[0] and curr[1], with care
// in case when curr[1] doesn't exist
if (++curr != end) { ++curr; }
}
> Do you mean unsigned branches? JB/JBE/JA/JAE instead of JL/JLE/JG/JGE?
Yeah, that's what I meant. Thanks for speaking more clearly.
> Are there actually code patterns where it's preferable to just JE/JNE?
That's a good point, and sidesteps the issue of pointer signedness.
I think sometimes JE / JNE isn't enough. For example, if you want to process a buffer in reverse order using pointer arithmetic:
/* p starts off one past the end of the buffer */
char *p = buffer + bufsize;
while (--p >= buffer) {
/* ... */
}
I'm not sure if this would technically be undefined behavior, though, as the C standard only explicitly permits computing a pointer one past the end of the array, and other out-of-bounds computations are undefined, IIRC.
In practice, I don't think any compiler would miscompile this.
Actually, this isn't true. The loop ends as soon as i == INT_MAX. It would only be endless if the loop condition were "i <= end" and end were equal to INT_MAX.
I'm not a C language lawyer, but I'd expect C to have a rule that calculating the one-past pointer will not overflow within an array object. So malloc would not be allowed to return such an allocation and this would be a bug in the caller, not in this function.
Yes, indeed it does. It's mostly ignored by most implementations but technically e.g. on architectures with 16-bit address space 0xFFFF isn't allowed to be part of an object (which makes 0x0000 an obvious choice for NULL).
that if pch+size overflows (unsigned, from a large address to a small address), then the while loop will be skipped entirely.
*depending if you think the compiler would ever allocate so as to put you into this position, for example usually the stack is at the top with the heap underneath it so stack overflow would be your risk, not address overflow.
You're right. I didn't think it through enough. If pchEnd doesn't overflow, then pch always exits the loop equal to pchEnd (assuming no breaks or returns). If it does overflow, the loop never starts. There is no case in which it goes into an infinite loop.
I (and Steve Maguire) had assumed that if pch were 0xFFFFFFFF - size and pchEnd were 0xFFFFFFFF, then it would run into an infinite loop, but it won't; neither pointer will overflow in that case.
It would only run into an infinite loop if you wrote something like this:
pchEnd = pch + size - 1;
while (pch <= pchEnd)
where pch + size is 0 (due to overflow), thus pch + size - 1 is the maximum possible pointer.
> The surprises are always the language's raw machinery showing through in unpleasant ways, and never a delightful bonus feature the designers added for you.
Not always. It's rare, but eg `o[objects].up[objects].t[textures]` was definitely a delightful bonus feature (where `objects` and `textures` are global arrays).
It's the "unsurprising" part in "unsurprising & horrifying" - if you manipulate raw pointers incorrectly, of course it will crash or be vulnerable, to the surprise of no one by design.
This rings true the more I think about it. Any large code base gets there with size, but with C one gets there pretty reliably after a certain heft of code.
That being said, while C's surprises are certainly fewer than other languages, it still does have a few surprising corner cases in contrary to its usual "portable assembly" character (without compiler optimization) - implicit type promotion in expressions with mixed data types, potential malloc() in printf(), possibility of Duff's device, and the likes. A Sun engineer Peter van der Linden has a book Expert C Programming, Deep C Secrets that explores these topics.