This is a bug in how Minecraft does things, not a bug in the generator itself (w...

marshray · 2024-04-19T02:01:44 1713492104

If "random" implies "contains no information", then it is indeed a bug in anything calling itself a "random number generator".

But that's just my opinion. The world is free to use the word however it wants.

Filligree · 2024-04-19T09:19:06 1713518346

“Random” means something closer to “contains maximal information”…

armb · 2024-04-19T15:25:22 1713540322

One way of thinking about it is that each bit has maximal information because knowing the entire previous history should give you no information about what the next bit will be ahead of it being revealed.

pclmulqdq · 2024-04-18T23:51:08 1713484268

Yeah, there is a big class of "RNG bugs" where someone uses a non-cryptographic RNG for secure things, not realizing that those things are supposed to be secure.

The classic example of these is a password manager that gave out recovery codes using a PRNG. This is in that class.

kbolino · 2024-04-19T13:25:16 1713533116

While a CSPRNG would have solved this problem, it also would've created a new one: much slower chunk loading and random item placement, which would have greatly slowed down the game simulation, and thus tanked framerate and playability. As it turns out, the right solution is to use multiple, isolated non-cryptographic random number generators with distinct state. That way, even though you can guess the state of one of them, it doesn't give you any insight into the others.

Retr0id · 2024-04-19T14:24:50 1713536690

Modern CSPRNGs can generate numbers at GB/s, I find it hard to believe it would slow the game down in a measurable way.

The "right" solution you describe sounds overcomplicated and error-prone (now you need to think carefully about which domains are separated) compared to just using a CSPRNG.

kibwen · 2024-04-19T14:55:46 1713538546

> The "right" solution you describe sounds overcomplicated and error-prone

It's not particularly. At program start-up, you seed the original PRNG. Then, you generate N numbers from the original PRNG and use those to seed N other PRNGs, then throw away the original PRNG. You don't need to think carefully about domains, you just create a new PRNG for everything you need a random number for. This makes your game easier to debug in a deterministic way, because now reproducing the behavior of one specific action that involves randomness no longer depends on every other action that involves randomness.

kbolino · 2024-04-19T15:24:14 1713540254

We may be at the point where CSPRNGs are viable for video game randomness, but that wasn't the case 10+ years ago especially when factoring in compatibility with 20-year-old hardware (high-end PCs from ca 2004 could play Minecraft).

Even so, a non-crypto PRNG can generally compute a new random number in 2-4 ALU ops. With SIMD optimization, that can amortize to under 1 cycle per byte, which means it takes under a nanosecond to generate a new 32-bit number. I'm not sure even the best hardware-accelerated CSPRNG on modern hardware can quite say the same just yet.

Retr0id · 2024-04-19T16:26:35 1713543995

chacha12, a construction that's been around in one form or another since the '00s, runs at well under 1 cycle per byte on good hardware and still plenty fast on "bad" hardware https://bench.cr.yp.to/results-stream.html (iiuc just using SIMD, no special acceleration)

But it doesn't really matter what things were like when the code was first written, it's about how it could be fixed in the present.

kbolino · 2024-04-19T18:38:58 1713551938

Video games do not generate large streams of data, they generate individual values on demand. Your link says, to generate 8 bytes, chacha12 on modern hardware needs 24-45 cpb. That's 192-360 cycles to generate enough bits for a pseudorandom double-precision floating-point number. Xoshiro256+ [1], a relatively high-quality generator for this purpose, can do it with 11 single-cycle ALU ops. So unoptimized xoshiro256+ should be 17 times faster than optimized chacha12 on the best hardware. This is a classic latency vs. throughput issue.

Now, maybe you could optimize the use of a CSPRNG here by filling large buffer(s) and sampling values from them. Some warm-up time could go a long way. However, I fear that you would run into one or more of the following problems:

- stop-the-world pause to refill the buffer (e.g. single buffer, no threading)

- synchronization delays from mutex locks (e.g. ring buffer refilled from a background thread)

- high memory usage (e.g. rotating pool of buffers, atomically swapped)

Needless to say, none of these solutions is anywhere near as simple to implement as a non-cryptographic PRNG.

Now let's consider determinism. Video games generally use a lot of differently seeded instances of the same PRNG algorithm to provide random numbers in different parts of the simulation. Since each part may demand random numbers at different rates, it's hard to replace several independent PRNGs with a single PRNG without compromising determinism. In the 4096 bytes necessary to run one instance of chacha12 at its maximum efficiency, you can fit 128 instances of xoshiro256+ or 512 instances of splitmix64 [2].

[1] = https://prng.di.unimi.it/xoshiro256plus.c

[2] = https://github.com/svaarala/duktape/blob/master/misc/splitmi...

Retr0id · 2024-04-19T21:47:04 1713563224

Generating random numbers into a refillable buffer is the standard for high-performance RNG, whether you're doing CSPRNG or not (it's how V8 uses xorshift, for example). "stop the world" is not a problem when the time to refill the buffer is still measured in nanoseconds.

There's nothing stopping you from having multiple CSPRNG instances, with deterministic seeds, if that's a design requirement.

kbolino · 2024-04-19T23:50:59 1713570659

I ran a microbenchmark with Go 1.22, which ships with both ChaCha8 (a little faster than ChaCha12) and PCG (a little slower than xoshiro256+). On my M1 Mac, I'm seeing 2.3 ns/uint64 for PCG and 4.5 ns/uint64 for ChaCha8. Assuming 3.2 GHz clock speed, that puts them at about 1 and 2 cpb respectively. With a floating-point conversion in the mix, they ran at 4.0 and 4.5 (!) ns/float64 respectively. That's far better than I would have thought.

So yes, a good, slightly buffered (internal state seems to be 300 bytes) implementation of a (quasi-?)CSPRNG is pretty damn close to a decent quality non-cryptographic PRNG and likely fast enough for most video games on most hardware. Though, very few people write games in Go.

pclmulqdq · 2024-04-20T05:37:59 1713591479

That's not quite a correct benchmark if I'm understanding what you benchmarked. Xoshiro and PCG need a chain of serial ops to generate a number, but not a lot of parallelism, letting the core do other stuff while it is working. ChaCha needs a lot of parallelism, and you are essentially saturating that core running it.

kbolino · 2024-04-20T12:54:43 1713617683

The builtin benchmarking tool in Go runs with parallelism set to 2x the number of cores. That should make evident any code that scales poorly. It is a tunable parameter though; I can try 4x cores etc.

I also ran it on x86 (AMD Ryzen 5600X) and got similar results; everything ran faster on a 4.6 GHz chip, but ChaCha8 stayed around 2 cpb while PCG improved a little to about 0.7 cpb.

I do have a ~10-year-old Celeron NUC laying around that I could use to test older/lower-end hardware. I can also publish the (admittedly very simple) program I'm using.

It's also worth noting that cpb is still a measure of average throughput not variance or latency and while Go's buffered ChaCha8 implementation amortizes to 2 cpb, when the 32-element buffer exhausts, generating a new number is much more expensive than the previous 31 numbers. I wouldn't say the performance is good enough for every use case.

kbolino · 2024-04-20T14:42:15 1713624135

Ok, I was able to figure out a few more things. First of all, the benchmark runner does not necessarily exploit the parallelism of GOMAXPROCS out of the box. Second, GOMAXPROCS seems to default to 1x logical cores; I last ran it on a machine where logical cores = 2x physical cores. I adjusted the benchmark code to use RunParallel and adjusted the parallelism of each run.

Testing on Celeron J3455 @ 1.5 GHz (4 physical and logical cores) gave me PCG at 1.2 cpb and ChaCha8 at 2.6 cpb with cpu=1, but PCG stayed relatively constant across cpu=1,2,4,8 (worst was 1.8 cpb) while ChaCha8 slowed to 6.5 cpb at cpu=2 and 7.5 cpb at cpu=4 and cpu=8.

Back on my M1 Mac (8 logical and physical? cores), both ChaCha8 and PCG generally got better with more cores. ChaCha8 got down to 0.76 cpb at cpu=4 (then regressed a bit at cpu=8) while PCG got down to 0.26 cpb at cpu=8.

I don't think any of these results rule ChaCha8 out completely, though again I'm looking from the perspective of video games, which generally monopolize a machine while running.

pclmulqdq · 2024-04-20T23:44:39 1713656679

The important question to benchmark for small buffer sizes isn't actually cycles per byte, it's micro-ops per byte. You are sort of getting there by adding threads, but if you want to measure micro-ops per byte by measuring cycles per byte, your benchmarking code should run a number of parallel implementations of the generator on each thread (and stick to 1 thread per physical core, too). Each generator will have a "roof" where more generators per thread doesn't cost any more in terms of cycles/byte. I am assuming that for PCG, that roof is around 4-ish on an old CPU, and may be as high as 6 or 8 on your macbook, while the roof for ChaCha will be close to 1.

ChaCha exploits instruction-level parallelism to get speed. PCG doesn't - it has a chain of instructions that must be executed sequentially. That means that when the PCG generator executes, it leaves gaps in the instruction stream that can be filled with other instructions for the game. That means a slowdown in the game that is more significant than what your benchmark suggests.

I'm going to do this and write blog about it (although I don't have a macbook), so we may be able to compare results.

kbolino · 2024-04-21T00:36:38 1713659798

Ok this makes sense. There are only 4 FP/SIMD units in the M1 and I'm guessing just one in the Celeron.

I look forward to reading your blog about this.

pclmulqdq · 2024-04-19T19:21:56 1713554516

There are CSPRNG constructions based on AES that would probably be appropriate, but they are still a huge performance hit. PCG, Xorshift, Xoshiro, and the like are all about 50x faster than the fastest CSPRNGs in sustained throughput, and also have smaller state and much less spinup time.

pclmulqdq · 2024-04-19T15:13:41 1713539621

The problem of dealing with randomness is about figuring out how to use CSPRNGs (and TRNGs) as little as possible, but not too little. CSPRNGs can get to a couple of GB/sec on a core, so they're not all that bad, but non-secure PRNGs are over an order of magnitude faster. I would assume that most of those things (eg chunk generation/loading) don't need secure RNG at all.

Sharding your RNGs by player is also an option for some games, and can be easier.