Fabulous comment, please keep posting more and sharing your knowledge/experience...

Fabulous comment, please keep posting more and sharing your knowledge/experience. And thanks for the re-reminder of 'ispc'. A couple tiny additions:

While loading each element with an individual 32-bit load usually negates the advantage of vectorization, if you are using 8-bit data, doing a single vector load and shuffling bytes within a 128-bit vector is really fast with AVX. And shuffling double- and quad-words within 256-bits with AVX2 can be a useful alternative even if you have to do it twice and combine. But even better is to rearrange your data layout to match the operations you are going to be doing.

AVX2 doesn't actually offer scatter support, although AVX-512 will. And AVX2 gather on Haswell normally isn't any faster than multiple individual loads, although it's no worse and the the expectation is that future generations will improve on this significantly. Also worth pointing out is that 256-bit YMM vectors are conceptually split in two, so apart from the slightly slower 32/64/128-bit permute instructions, it's not possible to move data from one 128-bit 'lane' to the other.