Yes, it is possible but its difficult to do it right. I ran into at least 3 or 4 "roadblocks" while designing this algorithm, fortunately I managed to get through all of them.
I've got an unrolled dual-stream going at 37.1 GBps on Threadripper 1950x (which has 2x AES pipelines). ~29.2 GBps for more "typical" usage (but still impractical).
The issue with this is that random-bits of this speed are almost useless. A division or modulus operation takes 80-cycles (~20 nanoseconds) so as soon as you do "rand() % 6" (assuming you wanted a 6-sided dice roll), you've completely wreaked the bit-generator. A branch-misprediction is like 15-cycles, so that singular if-statement in your code that uses the RNG-number would be 5x slower than the generator itself.
Its fast, but I don't know how practical it is. There's no practical way to use all of these random bits that I'm aware of.
I did get a bit-compatible POWER9 version running by the way. So its portable across systems (at least, systems with an AES-encoder. ARM, Power9, and 2010-era x86.). Rasp. Pi doesn't have a CPU-AES engine however (its outside of the core), so Rasp. Pi can't use this methodology.
--------------
If I were to actually push forward with this project more, I'd need to study on whole-program optimization, so that the (rarely used) XMM registers can be pragmatically used in a lot of code. The entirety of this RNG fits inside a single XMM register, so it'd benefit grossly from whole-program optimization.
The XMM-data should also be fed into SIMD somehow, some kind of engine into ispc or OpenMP SIMD code. A bunch of raw-bits coming in through assembly commands is somewhat impractical, no matter how you look at it.
Dan Lemire informed me of an amazing algorithm (that does work with AES XMM registers) that creates uniform random numbers out of a few multiply instructions. So I can get XMM-registers full of arbitrary integers with very low bias. (ex: numbers between 0 and 100) out of the bitstream.
I also managed to convert the numbers into single-precision floats and double-precision floats between 0 and 1.0 uniformly.
Still, this only really seems practical if you have SIMD-code that actually consumes all of these bits. Converting from XMM-registers into RAX registers has quite a bit of slowdown.
I don't think the typical simulation code is actually limited by the speed of RNGs at all.
Yes, it is possible but its difficult to do it right. I ran into at least 3 or 4 "roadblocks" while designing this algorithm, fortunately I managed to get through all of them.
I've got an unrolled dual-stream going at 37.1 GBps on Threadripper 1950x (which has 2x AES pipelines). ~29.2 GBps for more "typical" usage (but still impractical).
The issue with this is that random-bits of this speed are almost useless. A division or modulus operation takes 80-cycles (~20 nanoseconds) so as soon as you do "rand() % 6" (assuming you wanted a 6-sided dice roll), you've completely wreaked the bit-generator. A branch-misprediction is like 15-cycles, so that singular if-statement in your code that uses the RNG-number would be 5x slower than the generator itself.
Its fast, but I don't know how practical it is. There's no practical way to use all of these random bits that I'm aware of.
I did get a bit-compatible POWER9 version running by the way. So its portable across systems (at least, systems with an AES-encoder. ARM, Power9, and 2010-era x86.). Rasp. Pi doesn't have a CPU-AES engine however (its outside of the core), so Rasp. Pi can't use this methodology.
--------------
If I were to actually push forward with this project more, I'd need to study on whole-program optimization, so that the (rarely used) XMM registers can be pragmatically used in a lot of code. The entirety of this RNG fits inside a single XMM register, so it'd benefit grossly from whole-program optimization.
The XMM-data should also be fed into SIMD somehow, some kind of engine into ispc or OpenMP SIMD code. A bunch of raw-bits coming in through assembly commands is somewhat impractical, no matter how you look at it.
Dan Lemire informed me of an amazing algorithm (that does work with AES XMM registers) that creates uniform random numbers out of a few multiply instructions. So I can get XMM-registers full of arbitrary integers with very low bias. (ex: numbers between 0 and 100) out of the bitstream.
I also managed to convert the numbers into single-precision floats and double-precision floats between 0 and 1.0 uniformly.
Still, this only really seems practical if you have SIMD-code that actually consumes all of these bits. Converting from XMM-registers into RAX registers has quite a bit of slowdown.
I don't think the typical simulation code is actually limited by the speed of RNGs at all.