For Monte Carlo simulations, CSRNGs are too slow. Xorshift is incredibly fast. You wouldn't want to use a CSRNG for path tracing rendering, for example; you're calling the RNG many times per ray and shooting dozens or hundreds of rays per pixel.
We've dealt with this in Rust—people love to write Monte Carlo simulations as language benchmarks, and Rust used to get dinged for preferring CSRNGs.
Xorshift is like 10 cycles. AES-NI can take hundreds, depending on the number of rounds.
In the case of brute force Monte Carlo path tracing, the speed of the RNG is important, because you have dozens or hundreds of rays per pixel, each with multiple bounces, and every bounce samples from your RNG. When your render times are 6 hours per frame to begin with, you really don't want to mess around.
Ok, let's say we're rendering at 8K (32 megapixels), 200 rays per pixel, 20 bounces per ray. And rendering that frame takes 6 hours. How many (mega)bytes of random data do we need per second?
Six whopping megabytes per second, for the entire job, across all cores! Well, that assumed you only need one byte per bounce. Maybe you need a few bytes?
And here I can pull an order of magnitude more and then some (180 megabytes/s) via /dev/urandom on OpenBSD (i.e. Chacha20, no hardware accel), using a single core of a few-years-old fanless mini PC (some Intel Celeron with "U" postfix).
I generally don't buy the argument that your PRNG needs to be super duper fast, because usually the simulation is doing much more than just generating random data, and all that work is so much more expensive than even a relatively expensive CSPRNG. At least my ray tracers are. Generating rays is actually trivial compared to traversing, intersecting and shading the scene.. goodness gracious if I need to do actual texture & normal map lookups at a high resolution, so many cache misses :(
That said, a simple and fast PRNG with little internal state can make life easier if you want reproducibility.. and it might be easier to build a reproducible simulation if your RNG gives exactly one word the size you need, instead of requiring you to generate a big block (and then extract things out of it somehow). This is especially the case with threaded sims. So yeah.
But in this case, I wouldn't be looking at something like XorShift, I'd be looking at a random map that allows me to use (e.g.) the pixel's coordinates and a frame & bounce counter as the input and get matching (constant) output. Now it no longer matters which thread renders which pixel and when, the result will be all the same. It's not hard to build such functions, but they will be slower than XorShift if you want comparable quality. Speed should be no problem though.
It's amazing how much efforts people are willing to invest to contradict others' experience on the Internet.
The GP is giving a real-life example (people having complained about something being too slow compared to xorshift) and yet you try to do the math to prove it couldn't happen…
The math was just the prelude to my real world experience. Please try to read and comprehend the rest of my post. If you can't comprehend it, then feel free to ask for clarification or just leave it be without adding any snide remarks.
People having complained is not a real-life example of an actual performance issue. Until otherwise proven, it's a real-life example of someone complaining about some microbenchmark that, based on my real life experience, would be completely irrelevant if they actually developed their real world problem-solving program which totally shifts the bottleneck to something that actually matters.
Alternatively, it's a real-life example of poor implementation or usage, and not necessarily poor performance of the underlying primitive. This also happens a lot. "Foo is slow!" is true for a lot of things if you use them wrong. Maybe it's simply the user's ignorance, or maybe the API isn't well designed to support their case. Maybe it's poor documentation.
Also, this is a technical forum. There's absolutely nothing wrong with people doing back of the envelope calculations to analyze the scale of a problem. If more people did that while pointing at real examples with concrete numbers, maybe we'd actually know things instead of just relying on hearsay and religion. Whatever, it's a cargo cult industry.
Here's the good part: if someone who knows better comes along, they find those numbers and can tell us why we're wrong. Otherwise, let's just take everything on faith?
It's amazing how often people (like you, who haven't contributed anything except noise to the discussion) just don't want to understand a problem and if someone comes along trying to understand it and do the math, they're called a lunatic.
And this is why some of the code I get to work on is so terrible, people are wasting time and complicating code to micro-optimize things that are completely irrelevant (a conclusion one could come to either by a ballpark-estimate or a quick profile), while massive bottlenecks sit unattended.
AES-128-CTR on my laptop is 4.8GB/sec per core. I'm sure something producing random enough output for simulation can be built that is as fast as memcpy, but would anything real be affected by the difference?
Yes, it is possible but its difficult to do it right. I ran into at least 3 or 4 "roadblocks" while designing this algorithm, fortunately I managed to get through all of them.
I've got an unrolled dual-stream going at 37.1 GBps on Threadripper 1950x (which has 2x AES pipelines). ~29.2 GBps for more "typical" usage (but still impractical).
The issue with this is that random-bits of this speed are almost useless. A division or modulus operation takes 80-cycles (~20 nanoseconds) so as soon as you do "rand() % 6" (assuming you wanted a 6-sided dice roll), you've completely wreaked the bit-generator. A branch-misprediction is like 15-cycles, so that singular if-statement in your code that uses the RNG-number would be 5x slower than the generator itself.
Its fast, but I don't know how practical it is. There's no practical way to use all of these random bits that I'm aware of.
I did get a bit-compatible POWER9 version running by the way. So its portable across systems (at least, systems with an AES-encoder. ARM, Power9, and 2010-era x86.). Rasp. Pi doesn't have a CPU-AES engine however (its outside of the core), so Rasp. Pi can't use this methodology.
--------------
If I were to actually push forward with this project more, I'd need to study on whole-program optimization, so that the (rarely used) XMM registers can be pragmatically used in a lot of code. The entirety of this RNG fits inside a single XMM register, so it'd benefit grossly from whole-program optimization.
The XMM-data should also be fed into SIMD somehow, some kind of engine into ispc or OpenMP SIMD code. A bunch of raw-bits coming in through assembly commands is somewhat impractical, no matter how you look at it.
Dan Lemire informed me of an amazing algorithm (that does work with AES XMM registers) that creates uniform random numbers out of a few multiply instructions. So I can get XMM-registers full of arbitrary integers with very low bias. (ex: numbers between 0 and 100) out of the bitstream.
I also managed to convert the numbers into single-precision floats and double-precision floats between 0 and 1.0 uniformly.
Still, this only really seems practical if you have SIMD-code that actually consumes all of these bits. Converting from XMM-registers into RAX registers has quite a bit of slowdown.
I don't think the typical simulation code is actually limited by the speed of RNGs at all.
Counter (0, 1, 2, 3, 4...) -> Multiply Const1 -> Bit-reverse -> XOR Const2 -> Multiply Const3 has been my PRNG for GPUs.
Const1 and Const3 need to be odd (bottom bit == 1). Const2 can be fully random. All three values can come from /dev/urandom with pretty good results with some preliminary testing. I haven't figured out a good methodology from selecting Const1 or Const3 aside from the bottom-bit == 1 and random (If I had more time, I'd probably work on it...)
---------
The "counter" is 64-bits and allocated per-thread. Assume 2^14 bits for 16384 GPU-threads, each GPU-thread then has 50-bits of values.
64-bit integer multiply is slow however. 32-bit multiply would be faster for most GPUs. But also in my experience, GPUs are almost always memory-constrained, not compute-constrained. So you probably can do fine with 64-bit integer multiply.
If you really need to stick with 32-bit integers, then make Const1 and Const3 per-thread specific. That gives a unique sequence of 32-bit numbers per thread, but a 32-bit cycle is inevitable every 4-billion numbers with only 32-bits of state.
It's not only monte-carlo simulations. You may want to add a bit of random noise to a high-resolution video sequence, for example. It is impractical to do so with a slow PRNG.
We've dealt with this in Rust—people love to write Monte Carlo simulations as language benchmarks, and Rust used to get dinged for preferring CSRNGs.