It's essentially the same, except for the API. However, you linked to the refere...

pbsd · on July 15, 2015

'A bit faster' is an understatement. On the benchmarking machine used by the PCG author, you would get a ~2.5x speedup over that implementation, which would put Chacha20 roughly at (amortized) 5 cycles per 32-bit word. If you go Chacha8, which is still cryptographically secure, that becomes 2-3 amortized cycles per word. The flagship PCG generator from the paper, 'PCG XSL RR RR 128', runs at 1.81 nanoseconds per word on the same machine. This converts to ~6 cycles per word. So I am not inclined to believe the claim, right on the frontpage, that Chacha20 is 'fairly slow' .

Unless you're hurting for space---and in a Haswell machine you probably aren't---the case for using a cryptographic generator everywhere is strong. If you're hurting for space, or are on a less desktop-oriented architecture, you probably are not going to like 64-bit integer multiplications and variable rotation counts either.

By the way, you're using the aligned _mm_load_si128 and _mm_store_si128 intrinsics to load and store the block: https://gist.github.com/orlp/32f5d1b631ab092608b1#file-chach.... But there's no guarantee that block is aligned to 16 bytes, so that code may crash in some compiler/platform combination.

orlp · on July 15, 2015

The fastest ChaCha20 implementation available (from http://bench.cr.yp.to/results-stream.html) on a modern Intel processor is 1.2 cycles / byte (cpb) for ChaCha20 and 0.6 cpb for ChaCha8. ChaCha is very fast.

The drawback however, is that these implementation only get that fast because ChaCha is embarrassingly parallel. The fastest implementations can be computing 16-24 blocks in parallel efficiently with AVX2. That's 1.5kB of random data generated at once. And it's only this fast in a closed loop, where all code and memory is hot.

For the above reasons and simplicity I've chosen to not do a parallel SIMD implementation in my gist.

But I agree that 'fairly slow' is not really fair to say, at least not in combination with 'Good' statistical quality (rather than Excellent, because the author found faults in ChaCha2). ChaCha8 is blazing fast, and so far no cryptographer has been able to do a successful attack against it. However, before, ChaCha was mentioned as being 'slow', and I attempted to fix that by emailing the author with my implementation.

-

Interestingly, I was (sadly I never got to finish it before the http://competitions.cr.yp.to/caesar.html deadline) working on an AEAD design similar to ChaCha, that was doing 1.5 cpb authenticated encryption on AVX2: http://www.liacs.nl/~opeters/orlein.pdf. SIMD plus embarrassingly parallel ciphers is a strong combination.

-

Thanks for the warning about the alignment. Because I tacked on the SIMD implementation later, I forgot to add the alignment requirement, I did that now.

pbsd · on July 15, 2015

> The drawback however, is that these implementation only get that fast because ChaCha is embarrassingly parallel.

That is my point exactly. Why would you design, in this day and age, a generator that doesn't vectorize well? It will not take full advantage of the CPU. Even with parallel streams, large integer multiplication and variable-length rotation are SIMD-killers. Regarding the 1.5KB of data, I suspect you can get away with less than that if you specialize to this application, but note that this is still around half the state size of mt19937.