Salsa20 design [pdf]

oofabz · on Aug 6, 2016

For those who don't know, ChaCha20, a descendant of Salsa20, has become one of the most common OpenSSL ciphers over the last few years. When I ssh to my servers, it uses ChaCha20 by default.

And most of you probably know this, but just in case you don't: the author of this paper, djb, is one of the most competent and trusted cryptographers ever.

chjj · on Aug 6, 2016

> When I ssh to my servers, it uses ChaCha20 by default.

OpenSSH, when using ChaCha20, also uses Poly1305 (also designed by djb) as its MAC for the AEAD.

Also, some good news: ChaCha20+Poly1305 will soon be the default cipher+mac for the bitcoin p2p network.

userbinator · on Aug 6, 2016

Do you mean OpenSSH? I think AES is still most common for SSL/TLS.

SpikeGronim · on Aug 6, 2016

ChaCha20 is implemented in hardware on many mobile platforms. It's often a preferred TLS cipher on Android. AES is common in hardware as well.

aseipp · on Aug 6, 2016

What mobile platforms implement ChaCha20? Can you point to any? I'm not aware of any widely available handset that claims to do this.

In fact, the whole reason ChaCha20/Poly1305 was even added to the TLS profile in the first place is because Google originally added it to their own OpenSSL fork, BoringSSL, as well as Android, and it was later proposed for inclusion in the standard. Google wanted a cipher that performed better in software than AES did - because the vast majority of all mobile platforms and handsets do not support AES acceleration either (ARMv8 does introduce cryptographic extensions for the SHA family and AES family, but 99% of handsets aren't those. Also I'm not sure if ARMv8 has a PCLMUL-equivalent for fast GCM computation, which is also a critical component of that scheme.)

That costs energy and battery life, because AES is very difficult to implement efficiently and securely in software, and even the fastest, most secure implementations are relatively slow. In contrast, ChaCha20 is incredibly simple to implement securely in software, even an efficient version is very-well within grasp of mortals (I've managed to do it myself).

That's why your Android phone uses ChaCha20 - not because it has hardware acceleration, but because it's fast in spite of not having it.

I'd be interested to know if any actual hardware implements this in the wild. Generally, a combination of AES-256 with GCM for systems with hardware acceleration, coupled ChaCha20/Poly1305 as a fallback software method, seems to be the way people are going. And ChaCha20/Poly1305, with enough effort, can get very close to rivaling AES performance in hardware on a contemporary x86 machine (ignoring actual ASICs and endpoint devices with hardware offload). For non-hardware AES impls, ChaCha should absolutely crush it in terms of performance.

wahern · on Aug 6, 2016

It's actually the opposite AFAIK. One of the selection criteria for the Advanced Encryption Standard (AES) was cheap hardware implementations, and it's one reason why Rijndael was chosen over some of the stronger ciphers.

DJB has criticized the selection criteria for both AES and SHA3 as being too focused on hardware efficiency. In his opinion it was much more important for software implementations to be simple and efficient. His algorithms tend to be elegant in software but complex in hardware, pretty much guaranteeing his candidates would never be chosen.

I'm not an EE so feel free to correct me, but I closely followed the standards process both times and that's my recollection of things.

aseipp · on Aug 6, 2016

> It's actually the opposite AFAIK. One of the selection criteria for the Advanced Encryption Standard (AES) was cheap hardware implementations, and it's one reason why Rijndael was chosen over some of the stronger ciphers.

Oh, I was aware of that bit (vaguely; to be fair I was a child during the AES competition, so I only remember a small bit of the history), I just meant AES is a bit slow in software relative to ChaCha today, is all, which I could have clarified.

EDIT: I think I realized now what you meant. When I said ChaCha20/Poly1305 could, with effort, rival AES-256 in hardware in the last paragraph of my post, what I meant was: a software version of ChaCha20 can get very close to a hardware version of AES, providing you put in a lot of effort.

I can see how that sentence is a mis-parse, sorry about that.

> DJB has criticized the selection criteria for both AES and SHA3 as being too focused on hardware efficiency. In his opinion it was much more important for software implementations to be simple and efficient. His algorithms tend to be elegant in software but complex in hardware, pretty much guaranteeing his candidates would never be chosen.

Yes, this is the basic impression I've gotten as well from all his work - to be fair, software implementations are much more agile and easy to deploy, so I think putting some focus on this is a good thing.

I am also not an EE, but I've heard similar things of this nature before (e.g. that ChaCha/Poly would be much more expensive in hardware compared to AES, which is truly a con, not a pro). I'd be interested if any actual EEs would chime in here.

But yes, given all that, I think AES-GCM + ChaCha/Poly1305 is a good pair that should cover most of your bases for an AEAD, for fast hardware and software implementations.

Gibbon1 · on Aug 6, 2016

Not sure about ChaCha but I implemented Salsa20 on a microcontroller. Looked to me that you could generate a mechanical proof that it's 'secure' IE, doesn't have a hole in the design. Also that the microprocessor isn't going to expose you to an oddball timing attack. The adds, xors and rotations aught to be single cycle and the code paths never change based on any of the results.

conradev · on Aug 6, 2016

Is ChaCha20 actually implemented in hardware on any platforms? I was under the impression that the algorithm itself is just really really fast in software (especially so with SIMD).

I implemented ChaCha20 in AArch64 assembly, and it was possible to encrypt/decrypt 6 blocks at once.

JoachimS · on Aug 6, 2016

The Cryptech project uses ChaCha as CSPRNG in our TRNG. We decided on ChaCha because of its performance and good security margin. I know of at least one more project that uses our ChaCha core.

https://cryptech.is/

ChaCha can efficiently be implemented in HW, esp in FPGAs that supports carry chains, which basically means most FPGAs.

It is somewhat hard do compare size and speed since both ChaCha and AES are so scaleable. In ChaCha there are many places where you can trade operator reuse with performance. But the fundamental operator size is 64-bits.

AES in comparison works on bytes and you can go from a single S-box (implemented as a table, as logic, as part of a T-box etc) that is reused in the datapath as well as key expansion all the way to a fully pipelined (10-14 rounds) humongous implementation. Very flexible and easy to adapt to the system requirements. One additional thing to note with AES is that for many cipher modes, the decryption functionality can be removed.

But with all this said. If I compare my implementation av AES (that includes decryption) with my implementation of ChaCha20, I get about 4x better performance with ChaCha with fairly close the same number of resources.

https://github.com/secworks/chacha https://github.com/secworks/aes

The ChaCha core requires more registers, esp for the API. This is due to the bigger block size (512 vs 128)

I like ChaCha in HW and thinks its a good choice. I'm currently working on a ChaCha20-Poly1305 core compatible with RFC7539 to make it easier for HW projects to use good AEAD ciphers.

https://tools.ietf.org/html/rfc7539

joveian · on Aug 6, 2016

Thanks for the perspective. One small correction/clarification: ChaCha operates on pairs 32-bits at a time, not 64-bits, which makes it nice for 32-bit only systems in software. I really wish ChaCha20/Poly1305 was included in benchmarks for the CAESAR AEAD contest since my understanding is that it would do a little better than NORX (at least in software and it would be interesting to see how it compares in hardware), which is generally the fastest of the secure non-AES options (e.g. disqalifying MORUS due to the BRUTUS identified adaptive chosen plaintext issue).

For those wondering why this came up now, the third round CAESAR candidates will be announced any day now. DJB's choices in Salsa20/ChaCha are still looking very good.

The ability to do relatively effient masking/blinding in LRX algorithms is a major advantage at least, but with NORX you need 64-bit operations to get a 256-bit key which is frustrating. I wonder if NORX32-f could be used to make a Salsa20/ChaCha style stream cipher where you operate on block size data (say use the pseudo-addition to incorporate the start state).

JoachimS · on Aug 7, 2016

Agree on having ChaCha20-Poly1305 in the benchmarks would be good. RFC 7539 has been publshed and there are already several applications using this combination (as has been mentioned).

Any winning algorithm(s) from Ceasar will compete with ChaCha20-Poly1305 and should be chosen to provide some clear advantage. Better performance, agilty, scalability, security including side-channel leakage and other attacks on implementations for example.

Really looking forward to see the round three announcement.

JoachimS · on Aug 7, 2016

Sorry, the brain mistyped 32 with 64. Thanks for pointing it out.

mankash666 · on Aug 6, 2016

This is inaccurate. AES is a block cipher, whereas salsa/cha-cha are stream ciphers. Block ciphers are easy to accelerate in hardware, as they act on "blocks" of data at a time, whereas streams almost go byte by byte

dfox · on Aug 6, 2016

Typical stream cipher produces stream of bits (not even bytes), and often are described in manner that can be readily converted into hardware, also most of such ciphers are non trivial to implement efficiently in software.

Stream ciphers done in software that are actually somewhat widely used are either based on iterating some block cipher like primitive (which may be purpose designed as in Salsa/Chacha) or are related to RC4.

IIRC the fact that you can derive stream cipher by iterating essentially any cryptographic primitive (eg. hash function) was one of the arguments used by DJB in his court case against US.

JoachimS · on Aug 7, 2016

What do you mean by typical stream ciphers? AFAIK the most common stream ciphers are A5/1 (and A5/2) used in GSM, Snow3G used in 3G and LTE, E0 used in Bluetooth and RC4 for WPA in Wifi.

Of these A5/1 generates bursts of 114 bits, E0 generates two bits at a time, Snow3G generates 32-bit words and RC4 generates bytes.

Implementing A5/1 in SW is not easy, but Snow3G can be efficiently implemented in SW. For RC4 there are many high performance implementations in SW.

I though agree that A5/1, E0 and Snow3G are designed to be efficiently implemented in HW.

Besides these algorithms block ciphers in stream cipher modes (esp CTR) are used a lot. KASUMI in 3G, LTE and AES in IEEE 802.15.4 (CCM mode) and WPA2 for example.

dfox · on Aug 7, 2016

A5/1 is probably perfect example of what I had in mind as it generates output one bit at a time and it's output has quite large period, the fact that in GSM it's used to generate pair of 114bit keystreams is somewhat irelevant to that. All three ciphers in eSTREAM hardware profile are specified in same way (although all of them are designed in a way that allows for more output bits to be computed in parallel)

JoachimS · on Aug 6, 2016

ChaCha generates keystream blocks of 512 bits. So a comparison to block ciphers in CTR mode is fairly correct.

mankash666 · on Aug 6, 2016

In fact, if go as far as saying that stream ciphers can't be accelerated as much as block ciphers can even with gpgpu techniques

JoachimS · on Aug 7, 2016

I respectfully claim that statement is wrong. Some of the most commonly used stream ciphers, A5/1, E0, Snow3G are explicitly designed to be efficiently implemented in HW.

Further, if you look at the eSTREAM you have the profile two algorithms that can be very efficiently be implemented in HW. And to be honest, the profile one algoritms can also be efficiently implemented in HW. I have implemented them all in HW and get good performance.

The stream cipher HC-128/256 for example is very fast in SW. But in HW I can parallelize the state read and updates in ways you can't do in SW due to lack of multiple read and write ports. Doing this you get multiple Gbps performance in HW even with low clock frequency.

https://en.wikipedia.org/wiki/ESTREAM

If you look at the stream cipher RC4, it was not designed for HW implementation. But in HW I can implement RC to do three reads and two updates in parallel and reach 1 cycle/byte. In a low cost FPGA I reach 500 Mbps performance, which is pretty ok. Not that I'm promoting the use of RC4. My implementation was just an experiment to see if it was possible to do such a parallel implementation. Oh, and it is not debugged so don't use it anyway. ;-)

https://github.com/secworks/rc4

whorleater · on Aug 6, 2016

It's sorta strange seeing this here today, since I'm not aware of anything happening with Salsa20 lately. But for people that are interested, the Petya ransomware that was running around a couple months ago used a borked version of Salsa20[1], returning 16-bit values instead of 32 bit values, which lead people to break it [2][3]. Moral of the story: don't roll your own crypto, even when you're a malware author.

[1]: http://blog.checkpoint.com/2016/04/11/decrypting-the-petya-r...

[2]: https://github.com/leo-stone/hack-petya

[3]: https://0xec.blogspot.com/2016/04/reversing-petya-ransomware...

lucb1e · on Aug 17, 2016

I submitted this because I found it an interesting read. The article is not written very formally and confirms many thoughts I had myself but which I've not seen in writing before.

I'm slightly (but happily) surprised this many others found it a nice read too.