It's essentially the same, except for the API. However, you linked to the reference implementation. My implementation also contains a SIMD implementation, if your CPU supports it. I assume that libsodium has a similar implementation in its codebase somewhere too.
My SIMD implementation can be made to run a bit faster, but this would mean computing more blocks in parallel, and thus increase the state size of the RNG even more. I chose to only compute one block at a time. An optimized stream cipher implementation will have all the implementations with different parellel block sizes embedded, to achieve optimal speed for long messages.
---
ChaCha20 is a stronger more modern variation on Salsa20. But the differences are minor.
But, to be precise, libsodium prefers XSalsa20 to ChaCha20. The core feature about XSalsa20 is that it supports a 192-bit nonce instead of the 64-bit nonce of ChaCha20. This means that it is safe to use a randomly generated nonce, while randomly generating a 64-bit nonce would result in collisions eventually. I believe this is the reason they chose XSalsa20.
XSalsa20 is a simple variation on Salsa20 that does a bit more initialization to turn the normal 64-bit nonce into 192-bit. I assume the same construction can be applied to ChaCha20, but since djb (the author of both Salsa, XSalsa and ChaCha) hasn't done it, I think that libsodium went with the safe approach and just used XSalsa20.
'A bit faster' is an understatement. On the benchmarking machine used by the PCG author, you would get a ~2.5x speedup over that implementation, which would put Chacha20 roughly at (amortized) 5 cycles per 32-bit word. If you go Chacha8, which is still cryptographically secure, that becomes 2-3 amortized cycles per word. The flagship PCG generator from the paper, 'PCG XSL RR RR 128', runs at 1.81 nanoseconds per word on the same machine. This converts to ~6 cycles per word. So I am not inclined to believe the claim, right on the frontpage, that Chacha20 is 'fairly slow' .
Unless you're hurting for space---and in a Haswell machine you probably aren't---the case for using a cryptographic generator everywhere is strong. If you're hurting for space, or are on a less desktop-oriented architecture, you probably are not going to like 64-bit integer multiplications and variable rotation counts either.
By the way, you're using the aligned _mm_load_si128 and _mm_store_si128 intrinsics to load and store the block: https://gist.github.com/orlp/32f5d1b631ab092608b1#file-chach.... But there's no guarantee that block is aligned to 16 bytes, so that code may crash in some compiler/platform combination.
The fastest ChaCha20 implementation available (from http://bench.cr.yp.to/results-stream.html) on a modern Intel processor is 1.2 cycles / byte (cpb) for ChaCha20 and 0.6 cpb for ChaCha8. ChaCha is very fast.
The drawback however, is that these implementation only get that fast because ChaCha is embarrassingly parallel. The fastest implementations can be computing 16-24 blocks in parallel efficiently with AVX2. That's 1.5kB of random data generated at once. And it's only this fast in a closed loop, where all code and memory is hot.
For the above reasons and simplicity I've chosen to not do a parallel SIMD implementation in my gist.
But I agree that 'fairly slow' is not really fair to say, at least not in combination with 'Good' statistical quality (rather than Excellent, because the author found faults in ChaCha2). ChaCha8 is blazing fast, and so far no cryptographer has been able to do a successful attack against it. However, before, ChaCha was mentioned as being 'slow', and I attempted to fix that by emailing the author with my implementation.
-
Interestingly, I was (sadly I never got to finish it before the http://competitions.cr.yp.to/caesar.html deadline) working on an AEAD design similar to ChaCha, that was doing 1.5 cpb authenticated encryption on AVX2: http://www.liacs.nl/~opeters/orlein.pdf. SIMD plus embarrassingly parallel ciphers is a strong combination.
-
Thanks for the warning about the alignment. Because I tacked on the SIMD implementation later, I forgot to add the alignment requirement, I did that now.
> The drawback however, is that these implementation only get that fast because ChaCha is embarrassingly parallel.
That is my point exactly. Why would you design, in this day and age, a generator that doesn't vectorize well? It will not take full advantage of the CPU. Even with parallel streams, large integer multiplication and variable-length rotation are SIMD-killers. Regarding the 1.5KB of data, I suspect you can get away with less than that if you specialize to this application, but note that this is still around half the state size of mt19937.
My SIMD implementation can be made to run a bit faster, but this would mean computing more blocks in parallel, and thus increase the state size of the RNG even more. I chose to only compute one block at a time. An optimized stream cipher implementation will have all the implementations with different parellel block sizes embedded, to achieve optimal speed for long messages.
---
ChaCha20 is a stronger more modern variation on Salsa20. But the differences are minor.
But, to be precise, libsodium prefers XSalsa20 to ChaCha20. The core feature about XSalsa20 is that it supports a 192-bit nonce instead of the 64-bit nonce of ChaCha20. This means that it is safe to use a randomly generated nonce, while randomly generating a 64-bit nonce would result in collisions eventually. I believe this is the reason they chose XSalsa20.
XSalsa20 is a simple variation on Salsa20 that does a bit more initialization to turn the normal 64-bit nonce into 192-bit. I assume the same construction can be applied to ChaCha20, but since djb (the author of both Salsa, XSalsa and ChaCha) hasn't done it, I think that libsodium went with the safe approach and just used XSalsa20.