This is so, so cool. Basically the holy grail as a distributed systems engineer. Like the author, I've also avidly consumed every Jepsen report but the effort of actually implementing Jepsen tests for my systems always seemed too high.
Very excited to see this technology democratized and made available to to more companies!
This is quickly becoming my favorite technical blog. Congrats Richie and Ryan. I didn't fully understand Antithesis the first time I ran into it; now it makes sense.
Question from another field that does a lot of simulation - why is the assertion that deterministic simulation testing, rather than something stochastic, is the gold standard.
Concurrent/distributed system bugs can be really finicky because they may depend on subtle timing conditions to manifest. So you might see a bug once, then try to re-run the test using the "same" inputs, and the bug doesn't appear a second time. This might be because e.g. threads aren't scheduled the same way as before, so some 1-microsecond-wide window of vulnerability for a race condition was missed. If you can't reliably reproduce the bug, it's much harder to study and fix.
Determinism lets you perfectly reproduce the bug as many times as you want. Perfectly as in, exactly the same thread+process scheduling, exact same memory and disk access times, exact same network packet transit times and orderings .. exact same everything. Then once you have returned to the bug, you can rewind time, to do things like explore counterfactual scenarios by varying the random seed from that moment on.
We do have randomness of course, otherwise it wouldn't be a very good fuzzer. But we save all the seeds, so it's a controlled, reproducible randomness.
From yet another field where deterministic simulation is often a goal (robotics), the ideal is a simulation test system that is deterministic for a given initialization (e.g. a random seed) so that for an initialization that causes some error to occur, you can reliably reproduce and resolve the error. Of course, you then need to run that system with a range of initializations to have confidence that you didn't just get lucky with the initialization.
In practice, this can be quite hard to do in the presence of uncontrolled non-determinism (e.g. thread/process/GPU scheduling)* and it is often more pragmatic to invest the time in better stochastic testing and logging than deterministic reproduction.
* Yes, these can be made closer to deterministic. But doing so often comes with reduced performance, such that the system you are testing would no longer match the system being deployed, defeating much of the purpose of the test in the first place.
> Antithesis has created the holy grail for testing distributed systems: a bespoke hypervisor that deterministically simulates an entire set of Docker containers and injects faults, created by the same people who made FoundationDB.
I remember the Antithesis founder was having a hard time explaining what exactly they did.
One of the cool tricks we can use is that since the testing is all fully deterministic, once we find an interesting point in a test run - even if it is “deep” into the run time wise - our system can start many new branches of test runs off of that moment or moments just prior. So it is much more efficient than having to re-do the work to get to that rare interesting moment for each new branch.
This article and previous Antithesis ones mention testing distributed systems and, as someone who works at a company specialized in exactly this, I am excited. However, I wonder if Antithesis could help with nondeterministic failures observed in unit and integration tests I encounter in my Jasmine and TestCafe suites. Most of the time, these are quite hard to reproduce - if at all possible - and a significant portion of failures is caused by genuine application bugs. I wish there was a tool that helped with these.
I think I just followed the official recommendations I found (which are probably stale now). I'll update it to r5, but it doesn't really matter. The price difference between the two is like 5%, but hardware only ends up representing a tiny fraction of Kafka's cost at scale (the real cost comes from EBS and inter-zone networking).
I could make the hardware free for Kafka in the comparison, and WarpStream would still come out significantly more cost effective. Cloud networking is really expensive.
https://news.ycombinator.com/item?id=39356920