Ok, I was able to figure out a few more things. First of all, the benchmark runner does not necessarily exploit the parallelism of GOMAXPROCS out of the box. Second, GOMAXPROCS seems to default to 1x logical cores; I last ran it on a machine where logical cores = 2x physical cores. I adjusted the benchmark code to use RunParallel and adjusted the parallelism of each run.
Testing on Celeron J3455 @ 1.5 GHz (4 physical and logical cores) gave me PCG at 1.2 cpb and ChaCha8 at 2.6 cpb with cpu=1, but PCG stayed relatively constant across cpu=1,2,4,8 (worst was 1.8 cpb) while ChaCha8 slowed to 6.5 cpb at cpu=2 and 7.5 cpb at cpu=4 and cpu=8.
Back on my M1 Mac (8 logical and physical? cores), both ChaCha8 and PCG generally got better with more cores. ChaCha8 got down to 0.76 cpb at cpu=4 (then regressed a bit at cpu=8) while PCG got down to 0.26 cpb at cpu=8.
I don't think any of these results rule ChaCha8 out completely, though again I'm looking from the perspective of video games, which generally monopolize a machine while running.
The important question to benchmark for small buffer sizes isn't actually cycles per byte, it's micro-ops per byte. You are sort of getting there by adding threads, but if you want to measure micro-ops per byte by measuring cycles per byte, your benchmarking code should run a number of parallel implementations of the generator on each thread (and stick to 1 thread per physical core, too). Each generator will have a "roof" where more generators per thread doesn't cost any more in terms of cycles/byte. I am assuming that for PCG, that roof is around 4-ish on an old CPU, and may be as high as 6 or 8 on your macbook, while the roof for ChaCha will be close to 1.
ChaCha exploits instruction-level parallelism to get speed. PCG doesn't - it has a chain of instructions that must be executed sequentially. That means that when the PCG generator executes, it leaves gaps in the instruction stream that can be filled with other instructions for the game. That means a slowdown in the game that is more significant than what your benchmark suggests.
I'm going to do this and write blog about it (although I don't have a macbook), so we may be able to compare results.
Testing on Celeron J3455 @ 1.5 GHz (4 physical and logical cores) gave me PCG at 1.2 cpb and ChaCha8 at 2.6 cpb with cpu=1, but PCG stayed relatively constant across cpu=1,2,4,8 (worst was 1.8 cpb) while ChaCha8 slowed to 6.5 cpb at cpu=2 and 7.5 cpb at cpu=4 and cpu=8.
Back on my M1 Mac (8 logical and physical? cores), both ChaCha8 and PCG generally got better with more cores. ChaCha8 got down to 0.76 cpb at cpu=4 (then regressed a bit at cpu=8) while PCG got down to 0.26 cpb at cpu=8.
I don't think any of these results rule ChaCha8 out completely, though again I'm looking from the perspective of video games, which generally monopolize a machine while running.