Hacker News new | past | comments | ask | show | jobs | submit login

The more fundamental thing to think about is:

- what are the total number of failures that can happen? - what is the probability of those failures occurring - how many sets of those can combine into a catastrophic failure?

And then from those numbers you can derive the probability of a catastrophic failure occurring.




Yes, but the thing that the Fundamental Failure Mode Theorem tries to draw attention to is "what is the probability of those failures occurring".

i.e. it's possible that some of those failures have already occurred, and you just haven't noticed because the redundant systems are being redundant and preventing the overall system from failing catastrophically. Or you have noticed, but think that the redundant systems are sufficient, not realising how much closer they bring you to a single point of failure. So the probability of your whole system failing are higher than you'd expect, because you already have failures which you think are p < 1 (possibly p << 1) but are actually p = 1.

In the case of the Gimli glider, they had two independent FQIS systems and a floatstick in case of a single failure, and a rule that the airplane was non-servicable in case of both failing.

On the flight in question, one FQIS was non-servicable. The second was servicable but had been switched off, but due to a miscommunication it was thought that the no-fly rule had been overridden and the plane was OK with a floatstick measurement. Further, if the fuel calulation from the floatstick measurement had been correct, they would have refueled the plane and no-one would ever have heard about Air Canada Flight 143 because everything would have been fine.

The problem was that they were knowingly operating in a failure mode, without either FQIS and disregarding the no-fly rule, and thinking that the floatstick measurement was sufficient. Therefore it only needed one further failure - miscalculating the amount of fuel from the floatstick - to bring about disaster.


Well, we can observe that the probability of a catastrophic failure in an airliner is in fact extremely low, since they happen extremely infrequently. So even if systems are operating in a failure mode, there appears to still be enough redundancy left to lower the odds enormously.

I sort of take the opposite approach with comments that the odds of such a failure are astronomical. Given how safe airliners are these days, anything that causes a bad emergency with one must be an extremely unlikely event. If it weren't, more airliners would crash than actually do. I've seen this going around with MH370, for example. People will dismiss an idea for what caused the disappearance with a comment that such an event is extremely unlikely. Well sure, pretty much by definition, whatever caused it has to have odds of something like a billion to one.

I'd also like to bring up a minor quibble, in that Air Canada 143 did not end in disaster. It certainly came close, and should be regarded as a serious incident with lessons to be learned, but ultimately everybody survived and the airplane was returned to service, precisely because there weren't quite enough failures to cause a disaster. Various things went wrong, but the pilots managed to stop the chain of events by successfully responding to the in-flight emergency. The ability of the airplane to continue flying and somewhat functioning after fuel exhaustion is a type of redundancy, and it ultimately saved it.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: