His point is that instead of byte swapping input, we should always use single-byte load operations because "it works for him."
But Plan9 is not a system known for its graphics, and I think performance would seriously suffer if everyone had to program like that. Being able to load a pixel as an int is the reason 32-bit RGB is used more often as a pixel format than 24-bit.
Of course it might not matter as much these days, GCC and LLVM can optimize his code sequences into bswap instructions automatically. And SIMD/shader code don't have endian portability problems I know of, if only because SIMD is already not portable.
GCC doesn't merge the 4 byte-level reads into one 32-bit read. Thus, it does cause some performance penalty. The true impact is probably quite low, but it does exist on x86.
It is true however, that GCC will take a series of bit ops and produce a 'bswap' instruction on x86, but that requires a full 32-bit word to start with.
On modern CPUs, a byte load/store is really an integer (i.e. 32-bit/64-bit depending on arch) load/store that is rigged to only affect the target byte. On IA64 and PPC, it would just SIGBUS out (as it probably should on x86/amd64 too, but they kept it for compat reasons)
AFIK ARM processors don't support misaligned word access.
AFIK misaligned word access is twice slower than aligned word access (requires 2 reads). So I don't understand "offers it for free". But this is still twice faster than the example code. Note that endianess and word alignment are two distinct problems.
The point made by the author addresses this issue from a different angle.
As the author say, programmers should always write endianess neutral code unless it is impossible which is generally at the interfaces, where data is read and written (I/O) by the program. If the code is correctly and intelligently optimized so that marshaling is done once, then the byte swapping may generally be expected to be a low frequency operation. In this case the most simple and portable code should be favored.
Trying to optimize this operation by word read and byte swapping provides an insignificant optimization with a higher cost on code portability and maintainability. The author is right on this.
Though it is also true that in some cases, the operation frequency is very high (i.e. reading million pixel values of an image). For these use cases, the programming overhead of using highly optimized code is perfectly justified. But then don't use half backed optimizations. Try to align data on words (twice faster), read by word (four time faster) and use byte swapping machine instruction available on the target CPU instead of the proposed shifts and bit masks.
My opinion is that good languages should provide optimized data marshaling functions in their library so that the code can be optimal and portable at the same time.
ARM supports unaligned memory accesses since v6. In most modern implementations, unaligned accesses falling entirely within a 16-byte aligned block have no penalty at all, while crossing 16-byte boundaries does impose a cost. If the locations of unaligned accesses are randomly distributed, this cost is still cheaper on average than accessing a byte at a time.
But Plan9 is not a system known for its graphics, and I think performance would seriously suffer if everyone had to program like that. Being able to load a pixel as an int is the reason 32-bit RGB is used more often as a pixel format than 24-bit.
Of course it might not matter as much these days, GCC and LLVM can optimize his code sequences into bswap instructions automatically. And SIMD/shader code don't have endian portability problems I know of, if only because SIMD is already not portable.