Honestly I would have said popcnt as well. Lookup table or bit shifts when I can...

gcp · on Oct 13, 2016

Popcnt isn't particularly well optimized in most micro-architectural implementations.

mitchty · on Oct 13, 2016

Looks that way with a quick test. But it looks like there may be a better way with SSE3 PSHUFB: http://wm.ite.pl/articles/sse-popcount.html

rayiner · on Oct 13, 2016

Is it? It looks like on most recent Intel CPUs, it's 3 cycle latency, 1 cycle throughput on a 64-bit register. A 8-bit LUT solution is going to less than 16-bits per cycle on any recent Intel/AMD CPU (maximum of two load ports).

gcp · on Oct 13, 2016

Hmm, much better than I remember. I guess this goes a long way to explain why this wasn't always seen in practice: http://danluu.com/assembly-intrinsics/