A bug story: data alignment on x86

ogoffart · on Nov 7, 2016

Again someone who relies on undefined behavior. Casting pointer of wrong alignement is not a platform specific behavior, it's an undefined behavior. Relying on it is an error.

The author did not know "What Every C Programmer Should Know About Undefined Behavior": http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

Another good link about that: http://blog.regehr.org/archives/213

loup-vaillant · on Nov 7, 2016

Again someone blaming the victim. I'm kind of sick of this.

While true, your comment doesn't get at the root of the problem. The obvious fact here is that there is a mismatch between the C standard and how the users really use it. The less obvious fact/opinion/fallacy, is that it is not automatically the user's fault.

The standard could be wrong.

Sure, there are reasons why such and such behaviour ended up undefined. Those reason are sometimes weak however: some behaviour ended up undefined on all platforms because some of them couldn't handle it reasonably. The alignment bug here is such an example.

To this day I don't understand why undefined behaviour wasn't specified on a platform-by-platform basis. We already have implementation defined behaviour, after all. I guess this is because it lets compiler writers unify their front-ends and optimizers, but as a result, we cannot use our platforms to their fullest potential.

bertit · on Nov 7, 2016

Well, it is blaming someone who does not know how to use its tools. If I had writen a blog post aboug how I spent one day trying to fix a leak and then discovered that I always had to call free() after malloc(); would it still be victim blaming than blame me for not knowing the basic of C?

> The alignment bug here is such an example.

It is not since it allows vectorisation. The error was to assume that what one writes in C translates directly to assembly.

benchaney · on Nov 8, 2016

You are missing the point. The standard could be wrong, but it is still the standard. You still have an obligation to conform to it if you are writing C code. It is a prescriptive document, not a descriptive one. It can only be wrong to the extent that it can make poor decisions. Regardless of what it says it is authoritative.

loup-vaillant · on Nov 8, 2016

Did I gave the impression we don't need to conform to that standard? That wasn't my intention. I just wanted to point out the standard is sometimes stupid.

Of course we have to conform to the standard, however crazy. The only alternative is forking the language itself.

saynsedit · on Nov 7, 2016

Undefined behavior is one of the most intelligent things the C designers did when designing the language.

It's a fact of life that not all syntactic forms will have meaning. What's sqrt(-1)? Trick question, There's no meaningful answer (in terms of real numbers alone)! Why should anyone specify if it crashes the program, returns 0, throws an exception, etc.? Who cares? garbage in, garbage out.

Another example, "the floor had a pretty day with his melted spaceship" that is a grammatically well formed sentence but what does it mean? Don't answer that!

loup-vaillant · on Nov 8, 2016

> Undefined behavior is one of the most intelligent things the C designers did when designing the language.

I agree, actually. They just went too far. Too many things are undefined for no good reason. Even sqrt(-1) is debatable, by the way: if your platform provides an efficient way to trap, it should probably trap, and the compiler should not assume it will never happen.

And if you want crazy optimizations, consider introducing unsafe assertions into the language. That is, arbitrary boolean expressions the compiler is allowed to assume will always return true.

saynsedit · on Nov 10, 2016

I don't necessarily disagree that they went too far though I assume they probably had good reason at the time in most cases.

I'd prefer it if the decision to trap or not were an option to the compiler.

It would be nice if there were a GUARANTEE() macro so that the programmer could specify conditions that would never happen even in a production build like: GUARANTEE(n >= 0). Also if trapping was enabled, it would trap at runtime. This is a nice post about that idea http://blog.regehr.org/archives/1096

acqq · on Nov 7, 2016

> The alignment bug here is such an example.

I would even claim that the language on all platforms definitely should be able to specify the "inconvenient" alignments, but that the platforms which can produce the fast code should. If you actually need some suboptimal alignment, you'll need it no matter if the '"language lawyers" cry now. I'd agree, the standard is potentially wrong for not having the possibility. It should be visually obvious that it's not optimal, but it shouldn't be too ugly to use it.

It's better to be able to declare something uniformly than to have pages of #ifdefs or to write a lot of ugly code.

This article and the comments here are good example. The article doesn't end there, it hopefully ends here in the comments, as qb45 claims that there is actually a declarative way for gcc, which at least on one version (the one which he tried) did produce the correct code:

https://news.ycombinator.com/item?id=12890429

But also see the other comments where some question if it realy works across the gcc versions.

MaulingMonkey · on Nov 7, 2016

> Again someone blaming the victim. I'm kind of sick of this.

C and C++ blame the victim - as do, in practice, compilers for them - by labeling what they've done "undefined behavior" and telling you to do better. I'm kind of sick of C and C++ too. Plenty better than I have attempted to argue the case for optimizers to hold back and behave saner, to stop their abuse. They have failed. I've done plenty of arguing against choosing C or C++ for projects that do not need it. I have failed. I can educate on undefined behavior, that victims may attempt to avoid and combat their abuse. Maybe someday they will break the cycle, and I will have had a hand in enabling them.

While perhaps ogoffart does not perfectly outline that maybe the language is at fault for some UB, ogoffart's links make the case. TIL about yet another source of undefined behavior:

>>> An unmatched ‘ or ” character is encountered on a logical source line during tokenization.

>> With all due respect to the C standard committee, this is just lazy.

My understanding is that this:

  int main() { return !"

Could launch NetHack. Disgusting.

glandium · on Nov 7, 2016

> by labeling what they've done "undefined behavior" and telling you to do better.

If only they did that. But no, they instead label what they've done "undefined behavior", and silently perform optimizations that may or may not break their program entirely. Double emphasis on silently.

oldmanjay · on Nov 7, 2016

There are no victims in engineering. Such phraseology should be saved for disciplines where emotion trumps reason, like politics.

loup-vaillant · on Nov 8, 2016

I disagree.

Engineering is full of needlessly awkward tools that end up being misused because of their useless warts. Every time that happens, the engineer that uses it is kind of a victim. And of course, there are the end users, who end up irradiated, spied upon, or robbed because of a technical failure allowed by needlessly unsafe tools.

(Big emphasis on "needlessly". Sometimes, the requirements are so stringent that only the unsafe tools do the job. Embedded environments, AAA games, or video encoders come to mind. Most of the time though, safer, less efficient tools are more than enough.)

Sagiri · on Nov 7, 2016

If a particular compiler specified that casting pointers of wrong alignments causes a segfault, it'd be perfectly acceptable to rely on that behavior. The standard would consider it UB, but that compiler has defined that behavior sufficiently.

Note, though, that a compiler simply doing a particular thing now isn't good enough to specify it in the sense that I mean. The compiler writers would have to explain (in a blog post or the like) the behavior and that they plan to keep that behavior in all future versions.

Someone · on Nov 7, 2016

Technically, if a compiler specifies how it handles undefined behavior (as opposed to implementation-defined behavior, which it must define), it stops being a C compiler :-)

I also think it is fairly unlikely that any C compiler will say anything about how it handles undefined behaviour because it would mean it has to generate awfully inefficient code. For example, a compiler could not optimize away most pointer dereferencing code if it promised that dereferencing odd addressses segfaults.

Yes, such checks might add at most a few percent to a normal program's running time, but add in all the other corner cases (int overflow, boundary checks, etc.) that also dat a few percent, amd before you know it your program runs at half the speed it could run at. If you find that acceptable, you shouldn't be writing C in the 21st century.

to3m · on Nov 7, 2016

The standard explicitly permits documented behaviour, even though it doesn't have to, because "undefined" covers that option, along with anything else you care to imagine: http://port70.net/~nsz/c/c11/n1570.html#3.4.3

MaulingMonkey · on Nov 7, 2016

> If a particular compiler specified that casting pointers of wrong alignments causes a segfault, it'd be perfectly acceptable to rely on that behavior.

This is a great way to make your programs "fun" to port to new platforms with new compilers in terrifyingly subtle ways. I prefer not to recommend this approach to solving specific cases of undefined behavior, although if you happen to disable strict aliasing (with e.g. -fno-strict-aliasing) as an additional layer of defensive paranoia, I'm not necessarily against that.

Someone · on Nov 7, 2016

You can also defensively add a quick test to your program's startup code and unit tests. Startup will take a tiny bit longer, but those porting your code will be thankful if they hit the problem, double so if you manage to emit a useful diagnostic.

bertit · on Nov 7, 2016

That would not work. Since the test may not cause any problems. While the compiler might find more optimzation opportuinities in the real program. (Just as the begining of the function did not have problem, but the for loop had.)

Just don't use undefined behavior.

Someone · on Nov 8, 2016

1. You write a few versions of a program.

2. Users report that it started crashing sometimes in version V.

3. After lots of debugging, you discover an input that reliably crashes your program after 30 minutes.

4. A bit later, you discover that your compiler started compiling function f so that it no longer works with unaligned data/buffers of exactly 8 bytes/whatever.

5. At the start of main, you add a dummy call to f with data that reliably crashes if your compiler decides to do that optimization again.

6. The program has become worse: it now always crashes, independent of its input, but you don't have to wait 30 minutes before finding out. That makes it way less likely that you ship a binary again that has the problem. It also makes it easier to tweak source code/compiler flags/whatever until the problem disappears.

Is that perfect? Absolutely not, it is more something of last resort, but depending on the costs of crashing versus those of sometimes crashing half-way through a run, it can be an improvement.

(this technique also can be used when your code hits compiler bugs)

Sagiri · on Nov 7, 2016

Well, yeah. If you have any reason to suspect that you will need to run your code on other platforms or compile with other compilers, don't do compiler-specific things. Just do it cross-platform the first time.

Hell, even if you don't think you'll ever need to do that, you should still avoid doing things that are platform or compiler specific.

GlitchMr · on Nov 7, 2016

The thing is, many compilers just assume that undefined behaviour won't happen, without defining any particular behaviour. And testing for something like pointer alignment on architectures that silently allow pointer misalignment is really, really expensive.

That said, you can use -fsanitize=undefined to verify correctness of a program (as far specification is concerned). Just be prepared for it being a bit slow.

Jweb_Guru · on Nov 7, 2016

It's not possible to determine all possible undefined behavior in a C program. -fsanitize=undefined is best-effort.

scott_s · on Nov 7, 2016

I believe you are confusing undefined behavior and implementation defined behavior. Undefined behavior is illegal under all compilers, and all bets are off if you do it. Implementation defined behavior is always legal, but different compilers are allowed to do different things.

Sagiri · on Nov 8, 2016

I'm not.

Undefined behavior is 'anything goes'. An implementation can choose a particular behavior that you can rely on for a particular case of UB, because, if the only rule is that 'anything goes', it doesn't violate that rule.

I'll admit that compilers don't generally do that - because specifying it could lead to fewer optimizations. But I did say "if", and there's no reason they couldn't do so in principle.

One could imagine a compiler with an extremely strict debug mode that traps on a number of situations that the standard deems undefined behavior via a segfault, in order to help people avoid relying on UB. Again - saying something like "casting misaligned pointers causes a segfault on [system]" would in no way violate a standard that says "casting misaligned pointers can do anything", because segfaulting falls under the umbrella of anything.

I think you're misinterpreting the fact that the results of undefined behavior can be ignored by a compiler for a requirement that it must be ignored by a compiler.

startling · on Nov 7, 2016

Undefined behavior is not illegal. The compiler can do anything with undefined behavior, including exactly what the author expected.

scott_s · on Nov 8, 2016

It is illegal, for any reasonable definition of illegal. See my comment from earlier in the year: https://news.ycombinator.com/item?id=10840497

startling · on Nov 8, 2016

"Illegal" seems stronger to me than "not strictly conforming", I guess (and so does "well-formed", for that matter). But I think we basically agree.

konstmonst · on Nov 7, 2016

"If a particular compiler specified that casting pointers of wrong alignments causes a segfault" Sure it does. It says using undefined behaviour can cause daemons fly out of your nose, among other things (including but not limited to segfaulting).

Sagiri · on Nov 7, 2016

If the semantics of casting pointers of wrong alignments were defined to be demons flying out of your nose, it would be perfectly acceptable to rely on that behavior.

junke · on Nov 7, 2016

If.

(famous reply from Sparta, Laconia, to Philip II of Macedon)

qb45 · on Nov 7, 2016

The correct solution for GCC is specifying 1-byte alignment for this particular array:

  #include <stdlib.h>
  #include <stdint.h>

  typedef uint32_t __attribute__((__aligned__(1))) uint32_t_unaligned;

  uint64_t sum (const uint32_t_unaligned * p, size_t nwords)
  {
      uint64_t res = 0;
      size_t i;
      for (i = 0; i < nwords; i++) res += p [i];
      return res;
  }

Probably works on clang too and IIRC the MS compiler provides similar functionality with different syntax. AFAIK there is no portable solution.

And I'm not sure how exactly this code will fail on architectures which don't support unaligned uint32_t.

stephencanon · on Nov 7, 2016

This is documented as not working[1] in all but the most recent GCC versions. E.g. gcc-5.4 documents:

> "The aligned attribute can only increase the alignment; but you can decrease it by specifying packed as well. See below."

but gcc-6.2 documentation adds:

> "When used as part of a typedef, the aligned attribute can both increase and decrease alignment, and specifying the packed attribute generates a warning."

FWIW, Clang has supported reducing alignment in this fashion for a few years now.

[1] it's possible (likely, even) that it has been working for a while but the documentation was only recently brought up to date; I haven't investigated too carefully.

qb45 · on Nov 7, 2016

Must have been supported for a while because this stuff is even used in Linux:

  #define __packed2__     __attribute__((packed, aligned(2)))
  
  /*
   * SystemV FS comes in two variants:
   * sysv2: System V Release 2 (e.g. Microport), structure elements aligned(2).
   * sysv4: System V Release 4 (e.g. Consensys), structure elements aligned(4).
   */
  
  struct sysv2_super_block {
          __fs16  s_isize;                /* index of first data zone */
          __fs32  s_fsize __packed2__;    /* total number of zones of this fs */

http://lxr.free-electrons.com/source/include/linux/sysv_fs.h...

codys · on Nov 7, 2016

Note the use of `packed`. `packed` causes the alignment to be set as small as possible (1 byte). The aligned(2) that follows then increases the alignment from 1 byte to 2 bytes.

qb45 · on Nov 7, 2016

Yeah, you're right. Add packed then if you want but I swear the code I posted works on 5.4 as it is.

speps · on Nov 7, 2016

In C++11 there is a standard for this : http://en.cppreference.com/w/cpp/language/alignas

qb45 · on Nov 7, 2016

I think it doesn't work in this case because

If the strictest (largest) alignas on a declaration is weaker than the alignment it would have without any alignas specifiers (that is, weaker than its natural alignment or weaker than alignas on another declaration of the same object or type), the program is ill-formed.

Also, using alignas on pointer variable will probably specify alignment of the pointer itself, not the pointed object. I suppose you can create a wrapper class around int with alignas(32) and specify the pointer as pointing to that, but this is going to be nasty given that you can't derive from int and have to write all those operators by hand.

d33 · on Nov 7, 2016

Anybody knows how Rust handles this problem?

alkonaut · on Nov 7, 2016

Not sure if this is what you are asking: Last time I tried alignment in Rust I worked around the lack of explicit alignment support by adding a zero length array of the correct size to the end of the struct. Not sure if alignment support from proper attributes has landed yet.

   [repr(C)]
   struct Something
   {
      pub foo: f32,
      pub _alignment: [EightBytes, 0]
   }

where "EightBytes" is a data type of size 8, to align the whole struct on 8 bytes.

steveklabnik · on Nov 7, 2016

Structs already insert padding to give them alignment:

    struct One {
      foo: u8,
    }
    
    struct Two {
      bar: u16,
    }
    
    struct Three {
      foo: u8,
      bar: u16,
    }
    
    struct Four {
      foo: u16,
      bar: u16,
    }
    
    fn main() {
        assert_eq!(1, std::mem::size_of::<One>());
        assert_eq!(2, std::mem::size_of::<Two>());
        assert_eq!(4, std::mem::size_of::<Three>());
        assert_eq!(4, std::mem::size_of::<Four>());
    }

alkonaut · on Nov 7, 2016

What I needed to do was to place e.g. a N byte struct exactly on an M byte alignment (e.g. 11 byte struct on 32 byte alignment etc).

steveklabnik · on Nov 7, 2016

Ah! Yeah, that's different, and as far as I know your way is the current right way to do it, but I'm not an expert on the subject.

Thiez · on Nov 7, 2016

Well given that Rust leaves struct layout undefined unless #[repr(C)] is specified, std::mem::size_of::<Three> is actually not guaranteed to be 4.

steveklabnik · on Nov 7, 2016

While this is true, it's a weird angle that's not itself super well defined. That is, the definition of repr(packed) implies that the alignment will always be there without it...

wyldfire · on Nov 7, 2016

I doubt you can count on all items being allocated at addresses that are multiples if their size.

It's not optimal but you can always use libc::posix_memalign()

alkonaut · on Nov 7, 2016

The final element of the struct is a zero length array of elements of size N bytes. So that element isn't padding, it has size 0! It's pure hinting. I'm not sure why or how this works under the hood I'm afraid. I used it successfully to call a library with pretty strict alignment requirements (Intel Embree).

Thiez · on Nov 7, 2016

The zero length array still must be aligned correctly. You can create a pointer to it (e.g. by taking &something._alignment[..] to create a slice) and pointers must have correct alignment, so it follows that a zero length array has the same alignment requirements as a longer array. So padding must be inserted in your struct so that the address of the zero-length array is correctly aligned.

alkonaut · on Nov 8, 2016

Hmm, I don't understand (sorry). The pointer to the internal array must be aligned. But how does the element size play into the padding. If we take these two examples?

   [repr(C)]
   struct StructA
   {
      pub foo: f32,
      _alignment: [SixteenBytes, 0]
   }

   [repr(C)]
   struct StructB
   {
      pub foo: f32,
      _alignment: [ThirtytwoBytes, 0]
   }

Are both just padded with 16-4 and 32-4 bytes respectively, so they are equivalent to making a padding like this in the first case?

   [repr(C)]
   struct StructAPadded
   {
      pub foo: f32,
      _padding : TwelveBytes;
   }

hsivonen · on Nov 7, 2016

Types in Rust have alignment as in C, and bad stuff may happen if you use unsafe code to violate alignment.

It seems that at present, if you want to generate an unaligned SSE2 load/store, you have to rely on modern memcpy (spelled copy_nonoverlapping in Rust) optimization. See the first reply to https://internals.rust-lang.org/t/unaligned-simd-sse2-in-par...

saynsedit · on Nov 7, 2016

Nah, correct solution is simply to use memcpy(), works on all compilers, all platforms, all versions, with SSE and with any flags specified:

  #include <stdlib.h>
  #include <stdint.h>

  uint64_t sum (char *p, size_t nwords)
  {
      uint64_t res = 0;
      size_t i;
      for (i = 0; i < nwords; i += 8) {
        uint64_t tmp;
        memcpy(&tmp, &p[i], sizeof(tmp));
        res += tmp;
      }
      return res;
  }

qb45 · on Nov 8, 2016

Nitpick: memcpy is string.h not stdlib.h, the type was uint32_t not uint64_t and you are making some unwarranted assumptions about sizeof(uint64_t), not to mention that the existence of this type is merely implementation defined ;)

Deal breaker: your memcpy invocation requires a sufficiently smart compiler to convert into normal unaligned load on x86 and seems to prevent GCC autovectorization. In this case OP actually didn't want vectorization, but in general it happens that such workarounds confuse compilers and produce worse code.

saynsedit · on Nov 8, 2016

I'm not sure I understand your deal breaker. For the platform he was targeting it produces optimal code, for other platforms it's merely slower (but not specifically slower, since the compiler is likely not a great optimizer across the board).

Vectorization is in general not applicable here since it usually requires aligned memory... not all implementations do, but most. In any case, benchmarking is more appropriate than armchair optimizing.

qb45 · on Nov 8, 2016

You are writing convoluted code and hoping that your compiler will figure it out and convert it internally to the form I posted. Sometimes it does, sometimes it doesn't. In this case it generates reasonable code but doesn't vectorize it for some reason. WTF.

I prefer to just add alignment specification and move on, assuming I don't care about portability. If portability matters, reread my original post ;)

saynsedit · on Nov 9, 2016

It's not convoluted. It's actually clear and well-defined making it easier to reason about.

I'd call compiler specific alignment attributes more arcane, convoluted, and susceptible to future bugs.

Vectorization isn't a panacea. You need to benchmark to be sure, lacking that I expect GCC to be better at optimizing code than you. If you disagree, please manually write a vectorized one that handles non-aligned addition and post your results :)

tropo · on Nov 7, 2016

You also need the may_alias attribute to prevent other problems.

Data written as int may be read as char, but going the other way is usually a standards violation. (an exception being if you had used char to implement a memcpy-like function, but in that case you should expect compiler bugs to bite you)

Thiez · on Nov 7, 2016

Implementing `memcpy` can be a challenge of its own. Gcc has a tendency to replace memcpy implementations with a call to... memcpy. Which is quite smart, just not in that particular context...

[1]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888

qb45 · on Nov 8, 2016

Interesting, but is this a real world problem or just a standard nobody follows? I'm under impression that lots of networking and storage code running in the wild casts back and forth between char arrays and other types without giving it a thought.

And yes, as for other casts, I'm well aware that they cause problems.

MaulingMonkey · on Nov 7, 2016

Specifying alignment might make the crash go away. Like the "disable SSE" solution of the post, it does not make the undefined behavior go away and is not actually a solution.

qb45 · on Nov 7, 2016

If certain compilers provide functionality like __aligned__ or __packed__ it isn't undefined anymore on these particular compilers. I was specific about this being a GCC/clang thing (GCC to be exact, but somebody else confirmed clang).

MaulingMonkey · on Nov 7, 2016

__aligned__/__packed__ allow you to control alignment and packing, useful for e.g. avoiding false sharing in multithreaded code, and potentially useful if you're doing some of the type aliasing allowed by the standard - namely, type punning to/from char.

Unless __aligned__ or __packed__ is documented as also relaxing the strict aliasing rules - if such documentation exists, I haven't found it - there's still undefined behavior from the strict aliasing violation by type punning size_t <-> uint32_t.

Bad alignment is not the only possible "bad" optimization compilers can apply that relies on this being undefined behavior. For example, they may mistakenly assume a size_t value in memory can be cached in a register, and not reloaded from memory when it calls code that uses only uint32_t* pointers, because in a standards compliant program these cannot be used to modify size_t values.

You need __may_alias__ as well.

acqq · on Nov 7, 2016

Have you actually tried and verified that? Reading the docs I've had an impression that it says that the "aligned" is for the alignment inside of the structures, and you here declare the simple type?

qb45 · on Nov 7, 2016

I wouldn't have posted this without checking. And as far as documentation goes, they say this attribute applies to "types" and the example they give is

            struct S { short f[3]; } __attribute__ ((aligned (8)));

where it actually is used outside of a struct so that this whole 6B thing is aligned to 8B instead of the default 2B.

acqq · on Nov 7, 2016

Thanks! Are you able to post how the generated asm of their innermost loop looks like with SSE2 turned on? I mean the equivalent of theirs

    .L13:
         movdqa   (%r8), %xmm2
         ...

qb45 · on Nov 7, 2016

movdqu, obviously. And the code is shorter now, I guess it drops this part responsible for processing the first 1~3 elements of the array to reach 16B alignment.

Sagiri · on Nov 7, 2016

The correct and portable solution is to not use uint32_t, but to use uint8_t, and construct the integers manually.

stephencanon · on Nov 7, 2016

Well, portable except for `uint8_t` not being guaranteed to exist. =)

Thiez · on Nov 7, 2016

Which is a feature! If you use `char` instead of `uint8_t` your program would still compile on a system that doesn't have `uint8_t`, but it is likely to do something entirely unexpected. At least when you use `uint8_t` you are warned at compile-time that your program is broken.

Sagiri · on Nov 7, 2016

It isn't?

At least that would be a failure at compile-time, rather than run-time.

stephencanon · on Nov 7, 2016

`uint8_t` cannot exist on any platform with `CHAR_BIT > 8`. Such platforms are non-existent in the mainstream CPU world, but surprisingly common in the DSP world.

And yes, it's a compile-time failure, which is great. My comment should not be read as criticism at all (though I would likely use `uint16_t`, as the OP says the code is intended to work with [presumably aligned] 16-bit words).

rwmj · on Nov 7, 2016

These SSE instructions that operate only on aligned data are a pain. It's not well known that Linux/x86 stack frames must always be 16 byte aligned. GCC uses this knowledge to use the SSE aligned instructions when accessing certain fields on the stack.

Unfortunately a while back the OCaml compiler generated non-aligned stack frames. Which is no problem for pure OCaml code and even saves a little bit of memory. However if the code called out to C, then sometimes and unpredictably (think different call stacks, ASLR) the C code would crash. That was a horrible bug to track down:

https://caml.inria.fr/mantis/view.php?id=5700#c10779

kps · on Nov 7, 2016

  > It's not well known that Linux/x86 stack frames must always be 16 byte aligned.

Always wasn't always always; that sad story is the source of your OCaml problems, among many others. Linux on x86 originally used 4-byte alignment, and 4-byte alignment is what you see if you RTFM¹. Later, gcc decided that they were in control, and unilaterally switched to 16-byte alignment. Backwards compatibility? Screw you. Other tools? Screw you.²

¹ https://refspecs.linuxfoundation.org/

² https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38496

gpderetta · on Nov 7, 2016

The worst part is that today 16 bytes alignment is no longer necessary as x86 can do unaligned vector load with little to no penalty while keeping the stack aligned all the time still has a cost.

readittwice · on Nov 7, 2016

I had the same problem with my jit, which also generated stack frames not aligned to 16-byte. My test program crashed on an SSE instruction in the Rust standard library (I dont' recall if this bug only occured in release mode, may have been already compiled code). I was pretty proud when I fixed this. Although I have to admit that after finding out that the accessed address was actually valid, I was already supposing that alignment was a problem. Fixing it was then straightforward since it was my own toy compiler.

userbinator · on Nov 7, 2016

Agreed, I've always found them unusual and perhaps a bit of a shortsighted decision --- they've been making processors seamlessly handle any alignment with perhaps an extra cycle, even for the MMX instructions, yet somehow felt the need to restrict much of the SSE ones into aligned and only provide one unaligned move.

The stack alignment restriction is also annoying when handwriting Asm, although fortunately it's only when calling into other C libraries that it needs to be minded.

jjoonathan · on Nov 7, 2016

> seamlessly handle any alignment with perhaps an extra cycle

I'm not up to date on the latest mitigation strategies, but the hairball of cache implications caused by unaligned access make me suspicious of that claim. If you (or your compiler) signal that you want performance by using vector instructions, I think it's completely fair for Intel to demand that you pay attention to alignment.

qb45 · on Nov 8, 2016

Case in point: a simple example of code which copies 1MB of data with SSE and slows down on misalignment:

https://news.ycombinator.com/item?id=12718625

Presumably due to the hairball of cache implications, as you put it.

But it also is true that the choice of aligned/unaligned instructions makes no difference if the array is aligned.

pcwalton · on Nov 7, 2016

> yet somehow felt the need to restrict much of the SSE ones into aligned

Because it wasn't worth spending die space on that as opposed to other things that matter a lot more for performance, presumably.

yuhong · on Nov 7, 2016

When I wrote the code in https://bugzilla.mozilla.org/show_bug.cgi?id=1283585, I had to spend most of the time dealing with alignment.

pjc50 · on Nov 7, 2016

Well, that's pretty horrendous. Note that the naive code which just casts the input to uint16_t would work fine. I can't help but wonder if the solution to this might have been better expressed as naive implementation + platform-specific assembly implementation.

After all, if you have to understand the underlying instructions executed in order to fix the problem, why not stop trying to make the compiler emit the "right" instructions and just write them yourself?

(Language lawyers: is casting a char* to a uint32_t* actually defined behavior? For unaligned data?)

Kubuxu · on Nov 7, 2016

No it is not, the author even cited the rule from C99 himself.

> 6.3.2.3 A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the pointed-to type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer. When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object.

The fist sentence is what is in play over here. If you have undefined behaviour in your program anything can happen.

userbinator · on Nov 7, 2016

After all, if you have to understand the underlying instructions executed in order to fix the problem, why not stop trying to make the compiler emit the "right" instructions and just write them yourself?

My thoughts exactly. It's especially true for something like this tiny sum-loop, where amusingly enough the "portable C" and C++ versions are longer than the sequence of Asm instructions itself! All the other typical arguments about maintainability etc. don't apply here either --- IPv4 checksum calculation has been defined and implemented in billions of other devices, and is never going to change.

Although what I think Intel could've done is added an optimised REP ADDSW ;-)

pjmlp · on Nov 7, 2016

People people still try hard to believe ANSI C is 1:1 to Asm, which is only kind of true when using language extensions.

So then better drop down to Assembly, or if there is no need for compiler portability, intrinsics.

shakna · on Nov 7, 2016

Going off a few answers similar to this one [0], I'd say that it isn't defined behaviour. It'll be specific to your compiler, and its optimisations.

The memcpy solution is what I've seen throughout the industry when you are trying to cast between non-trivial types.

[0] http://stackoverflow.com/a/4318446

Dylan16807 · on Nov 7, 2016

> if you have to understand the underlying instructions executed in order to fix the problem

You don't. You can simply write the code without casting pointers. Sticking to char* will give you straightforward, working, annoying-looking code.

C is surprisingly poor-suited for doing input and output with data structures.

GlitchMr · on Nov 7, 2016

Compiler is allowed to assume alignment of pointers (what are you doing is creating a pointer to a value with invalid alignment, hence undefined behaviour (just creating a pointer is undefined behaviour)). The correct solution would be to read values indirectly. For example, a function like that could be used to replace every access to "q" variable.

    static uint32_t read(const char *p, size_t index) {
      uint32_t out;
      memcpy(&out, &p[index * sizeof out], sizeof out);
      return out;
    }

A compiler can recognize this pattern, and continue to use unaligned accesses that would work.

This has a cost of unaligned accesses on non-x86 platforms (a quite big at that), but considering the original code didn't work on these at all, it's an improvement.

amadvance · on Nov 7, 2016

Is this equivalent ? At least to me this seems a more common pattern than using memcpy(), and I expect to work better on old compilers.

  static uint32 read(const void *ptr) {
    const uint8 *b = (const uint8 *)ptr;
    return (b[3] << 24) | (b[2] << 16) | (b[1] << 8) | (b[0]);
  }

qb45 · on Nov 8, 2016

Functionally it seems equivalent, except that it doesn't work on big endian machines (assuming the ints are in native endian).

On dumb compilers it may be faster if a few bitops happen to be faster than a call to memcpy. Which sounds plausible.

On modern compilers it boils down to whether your compiler can optimize either one or both of these patterns into a single unaligned load. GCC for example certainly optimizes 4 byte memcpy, probably already at -O2, but whether it recognizes your pattern I don't know. Compile it and check.

asveikau · on Nov 7, 2016

The article's final solution (sum3) is exactly this. It does a memcpy into a temporary.

slededit · on Nov 7, 2016

At some point you might as well byte the bullet and just write the code in assembly.

krylon · on Nov 7, 2016

> byte the bullet

Is that a clever pun or Freudian slip?

slededit · on Nov 7, 2016

Accidental but I'll keep it.

ambrop7 · on Nov 7, 2016

Note that even if you try to manually correct the pointer to work on aligned data (read any initial bytes via char pointer and read the rest via uint32_t pointer), you still generally have undefined behavior: strict-aliasing violation. And the worst thing here is that whether you do have a violation depends on how other code accesses the same data / how the object is initially declared. E.g., you're fine if the original declaration is char[] or uint32_t[], but not if it's uint16_t[]. Because that would entail access to the same data via both uint16_t and uint32_t, a violation of strict-aliasing.

Actually two out of three inet checksum implementations in lwIP have this bug [1].

And like the problem discovered in the article, this is NOT theoretical. I have personally seen code "miscompiled" due to strict aliasing violations (in that case, packed structures were involved).

I think the only way to do this "manual alignment handling" is to use assembly, either by writing the entire thing in assembly, or using inline asm sections for doing the individual 32-bit memory reads/writes.

Funny story... When I was looking for a fast inet checksum implementation to use for an embedded ARM project, I took the one from RTEMS, which is written in C with much inline asm, and like the lwIP code, it has strict aliasing violations (and also problems compiling correctly with clang). What I did was, compiled it to assembly with gcc once, then included this compiled assembly in the source code. Assuming that this was compiled correctly, I don't need to be afraid of future compiler change breaking it.

[1] http://git.savannah.gnu.org/cgit/lwip.git/tree/src/core/inet...

dmm · on Nov 7, 2016

Strict aliasing can worked around by casting through a union, essentially introducing a new type which can alias any of its component types. Mike Acton has a good description of it with lots of assembly examples here:

http://cellperformance.beyond3d.com/articles/2006/06/underst...

nikanj · on Nov 7, 2016

For what it's worth, people like Linus seem to despise the strict aliasing optimization. https://lkml.org/lkml/2003/2/26/158

lukego · on Nov 7, 2016

Related Snabb experiments with IP checksum in C with automatic vectorization, C with vector intrinsics, and AVX2 assembler: https://github.com/snabbco/snabb/pull/899

novia · on Nov 7, 2016

I'm taking assembly right now, and we're working on our first RISC project after spending all semester working with the x86. Why does RISC crash if the bytes are not aligned?

brassic · on Nov 7, 2016

Early RISC processors read data from memory 32 bits (4 bytes) at a time, and these reads had to be aligned on 4-byte boundaries. This was a feature of the memory architecture, not just the processor.

Thus an aligned read of a 32-bit integer took one memory access; an unaligned read took two, which took twice as long. This killed performance.

Rather than quietly performing badly, the processor threw an exception to encourage you to fix your code.

The ARM2 worked a bit differently. It ignored the bottom two bits of the address when it read a value from memory. When the read was complete the value was rotated by the value of the bottom 2 bits multiplied by 8. This had the effect of putting the byte referenced by the full address in the bottom 8 bits of the 32-bit register. A flag in the instruction let you optionally mask off the top 24 bits to simulate a byte read.

monocasa · on Nov 7, 2016

It costs a non trivial amount of transistors (when compared with an ascetic RISC core) to load words from arbitrarily aligned addresses. 'RISC' used to be about removing those kinds of features, although the term has been a bit bastardized since then).

kabdib · on Nov 7, 2016

Depends on the processor. Some RISC systems won't generate a fault, but others will. Some of the non-faulting ones will load unaligned data correctly (even across page boundaries), while others will helpfully swizzle bytes around in the result for you, or ignore the low bits.

"It depends."

Hello71 · on Nov 7, 2016

You can probably also just pass -fno-strict-aliasing to gcc.

ot · on Nov 7, 2016

If you're willing to use compiler extensions, you can avoid the memcpy by using packed structs. This can generate better code.

Folly has a generic `loadUnaligned()` that uses this trick: https://github.com/facebook/folly/blob/5d52fb8c30e567403b8cc...

phkahler · on Nov 7, 2016

What if you put the array in a struct and made a union of both uint32_t and uint8_t? Would the union with the larger size force the compiler to generate a 4-byte aligned array for the bytes?

I suggest this because it would be portable without any compiler specific stuff.

mzs · on Nov 7, 2016

It's already too late, the data is read from a file so the base of the array can be say at address 0xYYYYY2.

koverstreet · on Nov 7, 2016

Attribute((aligned)) might be useful here.

amelius · on Nov 7, 2016

TL;DR: Even though most instructions of your processor (x86) allow data to be aligned on any byte, your compiler might not.

pklausler · on Nov 7, 2016

So much HTML to complain about C working the way C is defined rather than the way the OP wants it to work! It's not that hard to write a fast ones'-complement checksum that's portable and compliant, but whining's always easier than coding.