In older times (around the GCC 2.7 series, I think), "-O9" was documented to be the "future-proof" setting that selects the maximum optimization level, whatever it is. In practice, GCC never defined anything beyond -O3, but I still use -O9 when I want to make GCC do its best (or worst). Call it force of habit from a lengthening experience. (With Clang I use -O3 because if I write -O9 it screams at me.)
(Note that -O3 is usually not that good an idea: aggressive loop unrolling can sometimes decrease performance because of cache issues, for instance. To optimize code properly, you have to make benchmarks and adjust both the code and compiler flags accordingly, in a well-thought feedback cycle. The use of aggressive optimizations in the blog post is to make the examples trip UB more clearly.)
Measuring the total runtime is only one of many kinds of timing attacks. The most effective timing attacks do not do that; instead, they measure time taken to access some specific memory areas _after_ the execution of the target code, in order to work out which cache lines were loaded by that code (this applies to both data and code caches, btw). These are "timing attacks" since they ultimately come from a measure of elapsed time, but not necessarily the time taken by the target code itself.
Also, when a thread is sleeping, the OS will schedule other threads, which opens many additional ways for attackers to notice that sleeping started.
A part of constant-time coding is to avoid branches based on secret data, thus branch-free programming techniques apply. However, there is more to it; for instance, you must also avoid any memory access whose address depends on secret data (otherwise, the secret could leak through observation of the cache status).
On the other hand, branches that do not depend on secret data are OK in constant-time code. Typically, when you process a chunk of data, the chunk length is not secret, and there will be a loop whose exit condition really is a conditional jump that depends on the length.
One somewhat ugly truth is, that you can't implement some algorithms securely in software while retaining good or even acceptable performance. As we've seen previously, performance is critical for crypto, because people will choose something faster over the proper, secure alternative, especially if the insecurity is not an obvious or hard failure, as it is with side channels.
The modern approach to this issue is to design algorithms specifically for software implementation and avoid entire classes of side channels already in the design of the algorithm. This is one of the noticeable differences between older primitives (NIST/SECG ECC, DSA, RSA, a whole bunch of ciphers) and newer primitives designed for software (EdDSA over sensible curves, X25519, Chacha20 and so on).
Making the overall computation take a fixed amount of time does not plug indirect leaks. In particular, if the computation makes memory accesses at addresses that depend on secret values, then this will populate the memory caches for these addresses, and evict from the caches whatever data was using the same slots. The attacker may work out, after the computation, which previous data elements were evicted from cache, thereby leaking secret information. Crucially, the cryptographic computation itself took a fixed amount of time, but leakage occurs nonetheless.
This kind of leak still counts as a "timing attack" because the test for cache eviction is based on the time it takes to next access the relevant element (and this occurs _after_ the cryptographic computation, from other code). Notably, this can still be done remotely, possibly from a large distance, thanks to the efficiency of modern networking (this is what makes timing attacks special among side-channel leaks: power analysis, electro-magnetic emissions, sound... require the attacker to be in the physical vicinity of the target system, while timing attacks may be performed from hundreds of miles away). Demonstrations have been made with network access (over ethernet or optic fibre), and, in another kind of setup, when the attacker can run his own code on the same hardware, even with logical isolation (another process, or even in another VM that happens to run on some other cores of the same machine).
Thus, "true" constant-time code will make no memory access at an address that depends on secret data (but the data contents, of course, may be secret). Similarly, it won't make conditional jumps that depend on secret data (because of the cache accesses for loading the code, and also because of the jump prediction cache, both having been successfully exploited in lab demonstrations).
I indeed learned a lot, and still learn a lot, by doing implementations. Doing a proper implementation forces me to consider all aspects; when the code runs properly, I know that I have, by definition, been exposed to all the parts. You cannot get that kind of exhaustiveness from simply reading an article.
However, doing implementations is not at all the same thing as publishing implementations! The first one or two attempts are always flawed in some way; only the third one can hope to be reasonably good. I took care to properly kill and dispose of the corpses of all my learning code.
The trick (and it's a difficult one) is to decide in advance that the code you write to learn will have to be deleted -- and stick to it. Developers have trouble letting go of their creations, in general. If you can maintain that discipline, then there is no problem in "writing your own crypto". But that is a big "if".
I've found that a good motivation for writing learning-only "throwaway" crypto code is as models for writing attack code; you don't even have to throw the code away, just publish it with the exploit.
But then, I'm a believer that everyone should learn crypto by breaking it, and clearly not everyone agrees with me.
Thanks for the input! This reminds me of what Amelie Nothomb says: "unlike a lot of writers, I have the decency to toss most of what I write". I don't think it's necessarily bad to publish crypto code. Marketing it as secure is another thing. I've marked most of my implementations as "readable" implementations meaning that are only there for educational purposes.
I wanted a clear situation with regards to laws on cryptographic software distribution and export. With my own server, I can keep everything in Canada, which makes things simpler.
It's hard to get by without all the fancy CVE features in GitLab, but software written in Ruby on Rails is banned in Canada, so they had to use good ol' reliable gitweb.
> Honestly, all my internal git commit messages are "...".
Very insightful about how security experts write security code, thanks! All that documentation which you hope to write some time - half of it should have been in your commit messages (that's what they teach on StackOverflow, no?)
I have absolutely nothing to say when it comes to crypto, but as a C dork I found it ... quirky that the encoding/decoding functions in inner.h (https://bearssl.org/gitweb/?p=BearSSL;a=blob;f=src/inner.h;h...) seem to use e.g. uint32_t to get a 32-bit type, while assuming that char is 8 bits (rather than using uint8_t). This seems strange.
I don't think it assumes it's 8 bits, but rather it assumes the values won't be outside the range of an 8 bit number, which should be fine, given that it's an octet-oriented protocol?
Using char pointers presumably is to get correct aliasing analysis?
Actually, C is a horrible language in many respects, but it is also the only language that can achieve any kind of decent compatibility in embedded systems, which is why I used it.
Also, when I say that BearSSL is written in C, it is partly a lie: some of it (especially X.509 certificate decoding, and handling of handshake messages) is done in T0, a new Forth-like language that I invented for the task (compiler is provided, and produces C code), specifically to have coroutines (incidentally, it means that most of that code read things byte by byte, with relatively few accesses to buffers, thus less potential buffer overflows).
do you really need coroutines for cert decoding and handling of handshake messages? It's not like you can parallelize these things and do something else in the meantime...
Certificates and handshake messages are nested structures, and I want to process them in a streamed fashion (e.g. I don't buffer a complete certificate, I don't have the RAM for that; BearSSL can decode certificates and messages that are larger than the total RAM is uses), so without coroutines, I would still need some sort of decoder with a stack of states, so I just accepted it and pushed the idea to its logical conclusion.
Despite the horrors implied by a Forth syntax, this actually made writing the code easier. As a bonus, it turns out that T0-generated code is very compact, so I could pack more behaviour into less bytes.
One of the point of the exercise is to provide constant-time implementations that provably do not leak such information. The hash functions, two AES, one DES/3DES, RSA and elliptic curve implementations in BearSSL have been written to be constant-time. I still have to write some document that explains how such things are done.
It is a combination of "many eyes" and "good documentation". What is needed is some good text that describes the design choice, the rationale, and all the tricky details; and then people who read it and think about it. I'll write and publish such text within the next few months.
Generally, open source software benefits from more users. But having a huge amount of users makes it more difficult to improve and cleanup because you can't just deprecate stuff easily. (like SSL2/3).
Also, having 100% of the internet using openssl makes the impact of a vulnerability in that library huge. Some diversity is probably a good thing.
I appreciate the time and effort that you are putting into it, good luck.
(Note that -O3 is usually not that good an idea: aggressive loop unrolling can sometimes decrease performance because of cache issues, for instance. To optimize code properly, you have to make benchmarks and adjust both the code and compiler flags accordingly, in a well-thought feedback cycle. The use of aggressive optimizations in the blog post is to make the examples trip UB more clearly.)