The article mentions two: unpacking 24 bit values and matrix multiplication, bot...

vinkelhake · on Nov 30, 2014

Where did you get 24 from? It's about vectorizing loads of 3 elements. The elements don't have to be 8-bit. The second example uses floats.

cwp · on Nov 30, 2014

Yeah, but 24-bit values (RBG) are what you'd find in video and gaming applications.

vardump · on Nov 30, 2014

Games, at least what I've seen, usually just don't use 3x(8|16|32) bits per pixel, if it's not needed as alpha. It's pretty rare to pack 24/48/96-bit pixels in memory. Traditionally, unaligned loads are just not worth it.

Sometimes packed formats were worth it. I remember doing RLE schemes for fast "alpha blending" (anti-aliased blitting really!) back in 2000 or so, before MMX or GPUs were common. The scheme I used had three different types of run lengths. Completely transparent, pre-multiplied alpha blended and completely opaque. 16-bit pre-multiplied values were in memory in 32-bit format like this: URGBU00A. Groups of 5 bits, except U denotes unused single bit. Frame buffer in RAM was in 15 bit format, 0RRRRRGGGGGBBBBB. With this scheme, alpha-blending needed totally just one multiply per pixel [1], compared to three with usual pre-multiplied alpha or six with normal alpha without pre-multiplication.

[1]: Frame buffer pixels were masked and shifted into 000000GGGGG000000RRRRR00000BBBBB, multiplied by A and shifted back in place. Because all values were 5 bit entities, no overflow (overlap) could occur in multiply, thus one multiply yielded 3 multiplied frame buffer pixel values, for R, G and B!

kayamon · on Dec 1, 2014

Presumably loading a 3D vector would be a good use-case?