"memory access violations" are reading beyond allocated storage for a variable. You're not allowed to dereference or even reference locations outside storage allocated for variables, except one past the last item of an array which can be referenced in a comparison (but cannot be dereferenced). Stack variables, memory allocated by malloc and memory allocated by mmap all conceptually behave this way.
... which is why trying the access is undefined behaviour, which is why there are no constraints on what the compiler has to do for that case. The standard says nothing about what the compiler has to do in the case that you try to dereference a pointer that points outside an object, therefore it may assume that you don't do so, therefore it may combine the four byte loads into one 32-bit load.
If you're worried about the actual C standard then you hit undefined behavior the moment you dereference a uint32_t* pointing to data that's actually char.
Actually, this is false. The language specifically allows aliasing char (both signed and unsigned) with other types as an exception to type-based aliasing.
Only in one direction. You can use char to access data that's really uint32_t, but you can't use uint32_t to access data that's really char. ("Really" here is what the C standard calls "effective type"; for normal variables it's equivalent to the declared type, but for malloc'ed memory it has a weird definition where you can change the type by writing to it.)
Which doesn't prevent the compiler from doing a 32 bit load, does it? Dereferencing a pointer to a location outside any object is also undefined, therefore, if there are no conditionals between subsequent byte reads, the compiler should be perfectly fine with combining them into a 32 bit load!?
The issue is that once you hit undefined behavior, the optimizer is allowed to do whatever it wants - the proverbial launch nukes, crash, raid your fridge, etc.
More likely, it could just not execute that instruction or any instruction that depends on it, because that's the fastest way to preserve undefined behavior.
> The issue is that once you hit undefined behavior, the optimizer is allowed to do whatever it wants - the proverbial launch nukes, crash, raid your fridge, etc.
That's not quite how it works, or at least not what usually happens. Optimizers don't look for undefined behaviour to then do whatever, they look for the set of defined behaviours and then try to generate the cheapest code that behaves correctly in all those cases--and if that means that undefined cases do weird stuff, that's an acceptable side effect.
That in particular means that the potential for undefined behaviour to occur at runtime does not make the program undefined. If you ask the user to input a number and you then use that number as an index into some array without any input validation, the user could enter an out-of-range index, in which case the behaviour of the program would be undefined. But if the user inputs a valid index, the program must not exhibit any undefined behaviour.
Therefore, yes, if any of the bytes are not inside any valid C object, then that would produce undefined behaviour. Which is precisely why the compiler would be allowed to combine the loads into one 32-bit load: As the code, if it takes that path, is going to load all four bytes, and accessing a byte outside any objects would result in undefined behaviour, the compiler may assume that all accesses are within valid objects, and therefore loading them all in one is correctness-preserving.
> Optimizers don't look for undefined behaviour to then do whatever, they look for the set of defined behaviours and then try to generate the cheapest code that behaves correctly in all those cases--and if that means that undefined cases do weird stuff, that's an acceptable side effect.
> if any of the bytes are not inside any valid C object, then that would produce undefined behaviour
The compiler might also assume that every one of these loads will be exactly 4 bytes apart. Even worse, it might know that int32_t allocations will always be aligned, so it will assume that every load will be 4-byte-aligned. It then might output code that loses track of the bottom two bits, or SSE instructions that only work with certain alignments.
It's true that the buffer overrun issue is the same for either version; I misread what was being replaced. But the alignment issue is specific to this version, and can't be ignored.
Yes, that is all true for the suggested replacement code, but you quoted the statement "Yes, I know it could be unaligned, but this is x86 where it would still be faster than 4 separate reads.", which referred only to what the compiler should do, and the compiler combining char reads into 32 bit loads obviously is not allowed to assume that the resulting loads are aligned ...
The sentence I quoted refers just to the technique of using 32-bit loads. That applies equally to your compiler suggestion and the C code you wrote. So I think it's a perfectly fine sentence to quote when pointing out a flaw of the C code, which is while unaligned accesses are fine in a vacuum, doing them with a wrong-typed pointer is unsafe.