There is a bunch of good advice here, but it's missed the most useful principal in my experience, probably because the motivating example is too small in scope:
The way to build reliable software systems is to have multiple independent paths to success.
This is the Erlang "let it crash" strategy restated, but I've also found it embodied in things like the architecture of Google Search, Tandem Computer, Ethereum, RAID 5, the Space Shuttle, etc. Basically, you achieve reliability through redundancy. For any given task, compute the answer multiple times in parallel, ideally in multiple independent ways. If the answer agrees, great, you're done. If not, have some consensus mechanism to detect the true answer. If you can't compute the answer in parallel, or you still don't get one back, retry.
The reason for this is simply math. If you have n different events that must all go right to achieve success, the chance of this happening is x1 * x2 * ... * xn. This product goes to zero very quickly - if you have 20 components connected in series that are all 98% reliable, the chance of success is only 2/3. If instead you have n different events where any one can go right to achieve success, the chance of success is 1 - (1 - y1) * (1 - y2) * ... * (1 - yn). This inverse actually increases as the number of alternate pathways to success goes up and fast. If you have 3 alternatives each of which has just an 80% chance of success, but any of the 3 will work, then doing them all in parallel has a 97% chance of success.
This is why complex software systems that must stay up are built with redundancy, replicas, failover, retries, and other similar mechanisms in place. And the presence of those mechanisms usually trumps anything you can do to increase the reliability of individual components, simply because you get diminishing returns to carefulness. You might spend 100x more resources to go from 90% reliability to 99% reliability, but if you can identify a system boundary and correctness check, you can get that 99% reliability simply by having 2 teams each build a subsystem that is 90% reliable and checking that their answers agree.
I disagree somewhat, influenced by the teachings of Nancy Leveson.
In the 1930's, yes, component redundancy was the way to reliability. This worked at the time because components were flaky and technical systems were simple aggregations of components. Today, components themselves are more reliable, but even when they are not, redundancy adds only a little reliability because there's a new, large, source of failure: interactive complexity.
Today's systems are so complicated that many failures stem from insufficient, misunderstood, or ambiguous specifications. These errors happen not because a component failed -- all components work exactly as they were intended to -- it is only that in their intended interactions they produce an unintended result. Failure is an emergent property.
This is why Erlang's OTP focuses on supervisor trees. At each level of the component hierarchy, you have redundancy. Subcomponents themselves may have interactive complexity, but a failure or misspecification in any of the interactions making up that subcomponent simply makes that subcomponent fail. This failure is handled at a higher level by doing something simpler.
And "do something simpler" is actually a core part of this strategy. You're right that "today's systems are so complicated that many failures stem from insufficient, misunderstood, or ambiguous specifications". In most cases, yesterday's system worked just fine, you just can't sell it as a competitive advantage. So build simple, well-understood subsystems as fallbacks to the complex bleeding-edge systems, or even just take the software that's been working for a decade.
In the limit, there is a hard tradeoff between efficiency and reliability.
Failovers, redundancies, and backups are all important for building systems that are resilient in the face of problems, for reasons you've pointed out.
However, failovers, redundancies and backups are inefficient. Solving a problem with 1 thing is always going to be more efficient that solving the same problem with 10 things.
It's interesting to see this tradeoff play out in real-life. We see people coalescing around one or two services because that's the most efficient path, and then we see them diversifying across multiple services once bad things happen to the centralised services.
This is a very important point, and often misunderstood on both a business & societal level. Reliability has a cost. If you optimize all redundancy out of a system, you find that the system becomes brittle, unreliable, and prone to failure. Companies like 3M and Boeing have found that in the pursuit of higher profits, they've lost their focus on quality and suffer the resulting loss of trust and brand damage. The developed world discovered that with COVID, our just--in-time efficiency meant that any hiccup anywhere in the supply chain meant mass shortages of goods.
> In the limit, there is a hard tradeoff between efficiency and reliability.
Yes, but notice that most things on the GP's comment have an exponential impact on reliability (well, on 1 - reliability), so they are often non-brainiers as long as they follow that simple model (what they stop doing at some point).
Imho, the problem is that it is hard to estimate trade-offs. Optimizations (not just in computer systems, but in general) often seen as risk-free, when in reality they are not. More often than not one will be celebrated for optimization, and rarely for resilience (dubbed as duplicate, useless work)
As always, life is not that simple, and redundant components can interact in harmful ways, correctness checks can create incorrectness, process managers or consensus algorithms can amplify small problems...
Just like every technique on the article also can turn out to reduce your reliability too.
> For any given task, compute the answer multiple times in parallel, ideally in multiple independent ways.
Just to be clear, while this particular technique is valid and used in space software, it isn't common at all in Erlang and not part of the "let it crash" principle.
the simple basic reality of statistics, a binomial distribution.
5 independent systems with 90% chance of success is mathematically as reliable as one that is 99.999%.
100x 90% systems would get you to 100 "9s" of reliability aka 99.99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999%
Except the binomial assumption obviously does not hold because
(a) failures are correlated, not independent, and
(b) many failures happen not at the component level but at the plane where components interact, and regardless of how much redundancy there is at the component level, there is ultimately just one plane at which they finally interact to produce a result.
Actually making 5 completely independent systems would be exceptionally hard. No shared code or team members, no shared hardware... For example, what 5 computing platforms would you use? x86, ARM, RISC-V and...?
Math rarely applies so easily to real life. Talking about "independent" systems is cheap.
If at all possible. How would you transport yourself to work using two independent systems?
It's relatively simple at the organizational level, just expensive (but linearly expensive, while often increasing subcomponent reliability is exponentially expensive!). Just give the same problem statement to two independent teams with two different managers, have a clear output format and success criteria, and let them make all their technical decisions independently.
Your example of "how do you transport yourself to work using two independent systems" is actually very apropos, because I and many other commuters do exactly that. If the highway is backed up, I bypass it with local roads. If everything is gridlock, I take public transportation. If public transportation isn't functioning (and it generally takes a natural disaster to knock out all the roads and public transportation, but natural disasters have happened), I work from home and telecommute. Each of these subsystems is less favored than the alternative, but it'll get me to work.
While these are reasonable approaches, I do not think they live up to the mathematical meaning of "independent", and so invalidate the chances calculation.
Your two teams might well both use in some place in the system the same hardware or software component. This will make the probability of failure between the systems not be completely independent, for all that you paid two teams and they worked separately. Spent a lot of money, and the results will not be as expected. If they both use x86 Intel, and a Meltdown kind of thing happens, your "independent" systems will both fail from the same cause.
The transport analogy works great if you somehow imagine the transportation to be instantaneous, and only the decision to matter. But if you are already on a train and the train is delayed, you are not walking back home and taking the car. You have multiple options for transport, but you do not have a system built of independent components. You are not using the train and the car and the highway and the local roads all simultaneously.
I don't think you understand the requirements for the formula you wrote to be valid. Your examples do not fit, for all that they are reasonable and useful approaches. Your actual reliability with these approaches falls way below the multiple nines you think of.
> The way to build reliable software systems is to have multiple independent paths to success.
That's a heuristic that might work sometimes.
If you really want to build reliable software systems, then at least prove them correct. There are some tools and methodologies that can help you with this. And of course even a proof isn't everything since your assumptions can still be wrong (but in more subtle ways).
The way to build reliable software systems is to have multiple independent paths to success.
This is the Erlang "let it crash" strategy restated, but I've also found it embodied in things like the architecture of Google Search, Tandem Computer, Ethereum, RAID 5, the Space Shuttle, etc. Basically, you achieve reliability through redundancy. For any given task, compute the answer multiple times in parallel, ideally in multiple independent ways. If the answer agrees, great, you're done. If not, have some consensus mechanism to detect the true answer. If you can't compute the answer in parallel, or you still don't get one back, retry.
The reason for this is simply math. If you have n different events that must all go right to achieve success, the chance of this happening is x1 * x2 * ... * xn. This product goes to zero very quickly - if you have 20 components connected in series that are all 98% reliable, the chance of success is only 2/3. If instead you have n different events where any one can go right to achieve success, the chance of success is 1 - (1 - y1) * (1 - y2) * ... * (1 - yn). This inverse actually increases as the number of alternate pathways to success goes up and fast. If you have 3 alternatives each of which has just an 80% chance of success, but any of the 3 will work, then doing them all in parallel has a 97% chance of success.
This is why complex software systems that must stay up are built with redundancy, replicas, failover, retries, and other similar mechanisms in place. And the presence of those mechanisms usually trumps anything you can do to increase the reliability of individual components, simply because you get diminishing returns to carefulness. You might spend 100x more resources to go from 90% reliability to 99% reliability, but if you can identify a system boundary and correctness check, you can get that 99% reliability simply by having 2 teams each build a subsystem that is 90% reliable and checking that their answers agree.