OK thanks, do you know why it's 16-byte aligned? The Clang code is exactly 16 bytes but the GCC code is 20 bytes and it pads it out to 32. For i-cache maybe? (I asked this on reddit too)
It's definitely an optimization thing, but I'm not actually sure off the top of my head why. According to https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Opt... -falign-functions is enabled at -O2 and higher, unless you use -Os. Presumably something about aligned instruction fetch/decoding can be faster.