This is a wonderful thread. Which tricks in there are worth playing around with ...

xoranth · on July 7, 2023

> I guess the compiler's unrolling heuristics generally aren't as good as that blocking "mod then div" alternative to Duff's device? Obviously taking `s` out of the loop condition is part of the magic.

The magic in this case is the compiler autovectorizer. Making the length of the loop a loop invariant allows the autovectorizer to kick in.

The reason "blocking" by accumulating on uint8_t helps further is that it allows the compiler to accumulate on 8 bit SIMD lanes, instead 32 bit SIMD lanes. The same operation on 8 bit SIMD lanes will, to a first approximation, do x4 the work per cycle.

zokier · on July 7, 2023

> Is the uint8_t just "no point in using something bigger" or does it likely help the compiler? Does/can the signedness matter as well as the size?

In a good world you could use just uint_fast8_t and compiler would optimize this question for you. In real world I don't think compilers are smart enough, or there are too many other constraints limiting them :(