There is a bunch of good advice here, but it's missed the most useful principal ...

kqr · 2024-10-09T04:27:44 1728448064

I disagree somewhat, influenced by the teachings of Nancy Leveson.

In the 1930's, yes, component redundancy was the way to reliability. This worked at the time because components were flaky and technical systems were simple aggregations of components. Today, components themselves are more reliable, but even when they are not, redundancy adds only a little reliability because there's a new, large, source of failure: interactive complexity.

Today's systems are so complicated that many failures stem from insufficient, misunderstood, or ambiguous specifications. These errors happen not because a component failed -- all components work exactly as they were intended to -- it is only that in their intended interactions they produce an unintended result. Failure is an emergent property.

The solution is to approach reliability from a system theoretic perspective. This very early draft contains the core of the idea, but not yet fleshed out or edited: https://entropicthoughts.com/root-cause-analysis-youre-doing...

nostrademons · 2024-10-09T11:31:37 1728473497

This is why Erlang's OTP focuses on supervisor trees. At each level of the component hierarchy, you have redundancy. Subcomponents themselves may have interactive complexity, but a failure or misspecification in any of the interactions making up that subcomponent simply makes that subcomponent fail. This failure is handled at a higher level by doing something simpler.

And "do something simpler" is actually a core part of this strategy. You're right that "today's systems are so complicated that many failures stem from insufficient, misunderstood, or ambiguous specifications". In most cases, yesterday's system worked just fine, you just can't sell it as a competitive advantage. So build simple, well-understood subsystems as fallbacks to the complex bleeding-edge systems, or even just take the software that's been working for a decade.

detourdog · 2024-10-09T16:32:31 1728491551

I have the opinion that todays very complicated system is a symptom of over complication for the problem at hand.

I’m working on the idea that there is better set of assumptions to use for directing technical development.

gukov · 2024-10-09T18:59:13 1728500353

Systems are not built in one go. They usually start out simple enough and become complex over time.

yen223 · 2024-10-09T00:46:58 1728434818

In the limit, there is a hard tradeoff between efficiency and reliability.

Failovers, redundancies, and backups are all important for building systems that are resilient in the face of problems, for reasons you've pointed out.

However, failovers, redundancies and backups are inefficient. Solving a problem with 1 thing is always going to be more efficient that solving the same problem with 10 things.

It's interesting to see this tradeoff play out in real-life. We see people coalescing around one or two services because that's the most efficient path, and then we see them diversifying across multiple services once bad things happen to the centralised services.

nostrademons · 2024-10-09T11:47:00 1728474420

This is a very important point, and often misunderstood on both a business & societal level. Reliability has a cost. If you optimize all redundancy out of a system, you find that the system becomes brittle, unreliable, and prone to failure. Companies like 3M and Boeing have found that in the pursuit of higher profits, they've lost their focus on quality and suffer the resulting loss of trust and brand damage. The developed world discovered that with COVID, our just--in-time efficiency meant that any hiccup anywhere in the supply chain meant mass shortages of goods.

marcosdumay · 2024-10-09T14:45:18 1728485118

> In the limit, there is a hard tradeoff between efficiency and reliability.

Yes, but notice that most things on the GP's comment have an exponential impact on reliability (well, on 1 - reliability), so they are often non-brainiers as long as they follow that simple model (what they stop doing at some point).

hippich · 2024-10-09T01:47:04 1728438424

Imho, the problem is that it is hard to estimate trade-offs. Optimizations (not just in computer systems, but in general) often seen as risk-free, when in reality they are not. More often than not one will be celebrated for optimization, and rarely for resilience (dubbed as duplicate, useless work)

marcosdumay · 2024-10-09T14:39:57 1728484797

As always, life is not that simple, and redundant components can interact in harmful ways, correctness checks can create incorrectness, process managers or consensus algorithms can amplify small problems...

Just like every technique on the article also can turn out to reduce your reliability too.

ramchip · 2024-10-09T13:53:11 1728481991

> For any given task, compute the answer multiple times in parallel, ideally in multiple independent ways.

Just to be clear, while this particular technique is valid and used in space software, it isn't common at all in Erlang and not part of the "let it crash" principle.

alexpetralia · 2024-10-09T11:19:29 1728472769

Interestingly this is exactly how I've come to define truth/correctness: https://alexpetralia.com/2023/01/25/how-do-we-know-if-data-i...

pistoleer · 2024-10-09T08:00:10 1728460810

Who will replicate the consensus checker?

the_sleaze_ · 2024-10-09T15:09:48 1728486588

Because he's the failover Gotham deserves, but not the validator it needs right now

manvillej · 2024-10-09T02:27:43 1728440863

the simple basic reality of statistics, a binomial distribution.

5 independent systems with 90% chance of success is mathematically as reliable as one that is 99.999%.

100x 90% systems would get you to 100 "9s" of reliability aka 99.99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999%

kqr · 2024-10-09T04:00:01 1728446401

Except the binomial assumption obviously does not hold because

(a) failures are correlated, not independent, and

(b) many failures happen not at the component level but at the plane where components interact, and regardless of how much redundancy there is at the component level, there is ultimately just one plane at which they finally interact to produce a result.

lucianbr · 2024-10-09T08:49:17 1728463757

Actually making 5 completely independent systems would be exceptionally hard. No shared code or team members, no shared hardware... For example, what 5 computing platforms would you use? x86, ARM, RISC-V and...?

Math rarely applies so easily to real life. Talking about "independent" systems is cheap.

If at all possible. How would you transport yourself to work using two independent systems?

nostrademons · 2024-10-09T11:37:56 1728473876

It's relatively simple at the organizational level, just expensive (but linearly expensive, while often increasing subcomponent reliability is exponentially expensive!). Just give the same problem statement to two independent teams with two different managers, have a clear output format and success criteria, and let them make all their technical decisions independently.

Your example of "how do you transport yourself to work using two independent systems" is actually very apropos, because I and many other commuters do exactly that. If the highway is backed up, I bypass it with local roads. If everything is gridlock, I take public transportation. If public transportation isn't functioning (and it generally takes a natural disaster to knock out all the roads and public transportation, but natural disasters have happened), I work from home and telecommute. Each of these subsystems is less favored than the alternative, but it'll get me to work.

lucianbr · 2024-10-09T13:19:57 1728479997

While these are reasonable approaches, I do not think they live up to the mathematical meaning of "independent", and so invalidate the chances calculation.

Your two teams might well both use in some place in the system the same hardware or software component. This will make the probability of failure between the systems not be completely independent, for all that you paid two teams and they worked separately. Spent a lot of money, and the results will not be as expected. If they both use x86 Intel, and a Meltdown kind of thing happens, your "independent" systems will both fail from the same cause.

The transport analogy works great if you somehow imagine the transportation to be instantaneous, and only the decision to matter. But if you are already on a train and the train is delayed, you are not walking back home and taking the car. You have multiple options for transport, but you do not have a system built of independent components. You are not using the train and the car and the highway and the local roads all simultaneously.

I don't think you understand the requirements for the formula you wrote to be valid. Your examples do not fit, for all that they are reasonable and useful approaches. Your actual reliability with these approaches falls way below the multiple nines you think of.

amelius · 2024-10-09T16:36:42 1728491802

> The way to build reliable software systems is to have multiple independent paths to success.

That's a heuristic that might work sometimes.

If you really want to build reliable software systems, then at least prove them correct. There are some tools and methodologies that can help you with this. And of course even a proof isn't everything since your assumptions can still be wrong (but in more subtle ways).