His point is that instead of byte swapping input, we should always use single-by...

perfunctory · on April 4, 2012

> I think performance would seriously suffer

Evidence please.

figglesonrails · on April 4, 2012

===============

#include <stdint.h>

uint32_t load_uint32_be(uint8_t* p) { return (p[0] << 24) | (p[1] << 16) | (p[2] << 8) | p[3]; }

uint32_t load_uint32_le(uint8_t* p) { return (p[3] << 24) | (p[2] << 16) | (p[1] << 8) | p[0]; }

===============

gcc -O3 -fomit-frame-pointer -S bo.c

===============

load_uint32_be: movl 4(%esp), %edx

        movzbl  (%edx), %eax

        movzbl  1(%edx), %ecx

        sall    $24, %eax

        sall    $16, %ecx

        orl     %ecx, %eax

        movzbl  3(%edx), %ecx

        movzbl  2(%edx), %edx

        orl     %ecx, %eax

        sall    $8, %edx

        orl     %edx, %eax

        ret

load_uint32_le: movl 4(%esp), %edx

        movzbl  3(%edx), %eax

        movzbl  2(%edx), %ecx

        sall    $24, %eax

        sall    $16, %ecx

        orl     %ecx, %eax

        movzbl  (%edx), %ecx

        movzbl  1(%edx), %edx

        orl     %ecx, %eax

        sall    $8, %edx

        orl     %edx, %eax

        ret

GCC doesn't merge the 4 byte-level reads into one 32-bit read. Thus, it does cause some performance penalty. The true impact is probably quite low, but it does exist on x86.

It is true however, that GCC will take a series of bit ops and produce a 'bswap' instruction on x86, but that requires a full 32-bit word to start with.

anaisbetts · on April 4, 2012

On modern CPUs, a byte load/store is really an integer (i.e. 32-bit/64-bit depending on arch) load/store that is rigged to only affect the target byte. On IA64 and PPC, it would just SIGBUS out (as it probably should on x86/amd64 too, but they kept it for compat reasons)

vardump · on April 4, 2012

Actually, a modern CPU loads a whole L1 cache line at once. Which is usually 64 bytes nowadays.

pagekalisedown · on April 4, 2012

This can also result in 2 reads if the memory isn't aligned.

astrange · on April 4, 2012

Desktop PPC CPUs (when there were such things) allowed misaligned memory operations with some performance penalty.[1]

x86 practically offers it for free in newer architectures (Sandy Bridge, Ivy Bridge and Bulldozer).[2]

[1] https://developer.apple.com/hardwaredrivers/ve/g5.html

[2] http://agner.org/optimize/instruction_tables.pdf (check MOVDQU timings)

chmike · on April 4, 2012

AFIK ARM processors don't support misaligned word access. AFIK misaligned word access is twice slower than aligned word access (requires 2 reads). So I don't understand "offers it for free". But this is still twice faster than the example code. Note that endianess and word alignment are two distinct problems.

The point made by the author addresses this issue from a different angle.

As the author say, programmers should always write endianess neutral code unless it is impossible which is generally at the interfaces, where data is read and written (I/O) by the program. If the code is correctly and intelligently optimized so that marshaling is done once, then the byte swapping may generally be expected to be a low frequency operation. In this case the most simple and portable code should be favored.

Trying to optimize this operation by word read and byte swapping provides an insignificant optimization with a higher cost on code portability and maintainability. The author is right on this.

Though it is also true that in some cases, the operation frequency is very high (i.e. reading million pixel values of an image). For these use cases, the programming overhead of using highly optimized code is perfectly justified. But then don't use half backed optimizations. Try to align data on words (twice faster), read by word (four time faster) and use byte swapping machine instruction available on the target CPU instead of the proposed shifts and bit masks.

My opinion is that good languages should provide optimized data marshaling functions in their library so that the code can be optimal and portable at the same time.

mansr · on April 4, 2012

ARM supports unaligned memory accesses since v6. In most modern implementations, unaligned accesses falling entirely within a 16-byte aligned block have no penalty at all, while crossing 16-byte boundaries does impose a cost. If the locations of unaligned accesses are randomly distributed, this cost is still cheaper on average than accessing a byte at a time.

whatusername · on April 4, 2012

So it's not quite a desktop... but the standalone server theoretically could be one I guess: http://www.nasi.com/ibm-power-720-express.php

jbarham · on April 4, 2012

> Being able to load a pixel as an int is the reason 32-bit RGB is used more often as a pixel format than 24-bit.

No: RGBA.

pmjordan · on April 4, 2012

Nope. Even 24-bit RGB with no alpha is stored with a wasted byte in video memory these days. (outside of legacy video modes) The reason is alignment.