Ok, I was able to figure out a few more things. First of all, the benchmark runn...

pclmulqdq · 2024-04-20T23:44:39 1713656679

The important question to benchmark for small buffer sizes isn't actually cycles per byte, it's micro-ops per byte. You are sort of getting there by adding threads, but if you want to measure micro-ops per byte by measuring cycles per byte, your benchmarking code should run a number of parallel implementations of the generator on each thread (and stick to 1 thread per physical core, too). Each generator will have a "roof" where more generators per thread doesn't cost any more in terms of cycles/byte. I am assuming that for PCG, that roof is around 4-ish on an old CPU, and may be as high as 6 or 8 on your macbook, while the roof for ChaCha will be close to 1.

ChaCha exploits instruction-level parallelism to get speed. PCG doesn't - it has a chain of instructions that must be executed sequentially. That means that when the PCG generator executes, it leaves gaps in the instruction stream that can be filled with other instructions for the game. That means a slowdown in the game that is more significant than what your benchmark suggests.

I'm going to do this and write blog about it (although I don't have a macbook), so we may be able to compare results.

kbolino · 2024-04-21T00:36:38 1713659798

Ok this makes sense. There are only 4 FP/SIMD units in the M1 and I'm guessing just one in the Celeron.

I look forward to reading your blog about this.