Complex systems fail, but they don't all fail in the same way and analyzing how they fail can help in engineering new and hopefully more robust complex systems. I'm a huge fan of Risk Digest and there isn't a disaster small enough that we can't learn from it.
Obviously the larger the disaster the more complex the failure and the harder to analyze the root cause. But one interesting takeaway for me from this list is that all of them were preventable and in all but few of the cases the root cause may have been the trigger but the setup of the environment is what allowed the fault to escalate in the way that it did. In a resilient system faults happen as well, but they do not propagate.
And that's the big secret to designing reliable systems.
> ...one interesting takeaway for me from this list is that all of them were preventable...
Every disaster is preventable. Everything on the list was happening in human-engineered environments - as do most things that affect humans. The human race has been the master of its own destiny since the 1900s. The questions are how far before the disaster we need to look to find somewhere to act and what needed to be given up to change the flow of events.
But that doesn't have any implications for software engineering. Studying a software failure post mortem will be a lot more useful than studying 9/11.
No, there is such a thing as residual risk and there are always disasters that you can't prevent such as natural disasters. But even then you can have risk mitigation and strategies for dealing with the aftermath of an incident to limit the effects.
> Everything on the list was happening in human-engineered environments - as do most things that affect humans.
That is precisely why they were picked and make for good examples.
> The human race has been the master of its own destiny since the 1900s.
That isn't true and it likely will never be true. We are tied 1:1 to the fate of our star and may well go down with it. There is a small but non-zero chance that we can change our destiny but I wouldn't bet on it. And even then in the even longer term it still won't matter. We are passengers, the best we can do is be good stewards of the ship we've inherited.
> The questions are how far before the disaster we need to look to find somewhere to act and what needed to be given up to change the flow of events.
Indeed. So in the case of each of the items listed the RCA gives a point in time where the accident given the situation as it existed was no longer a theoretical possibility but an event in progress. Situation and responses determined how far it got and in each of the cases outlined you can come up with a whole slew of ways in which the risk could have been reduced and possibly how the whole thing may have been averted once the root cause had triggered. But that doesn't mean that the root cause doesn't matter, it matters a lot. But the root cause isn't always a major thing. An O-ring, a horseshoe...
> But that doesn't have any implications for software engineering.
If that is your takeaway then for you it indeed probably does not. But I see such things in software engineering every other week or so and I think there are many lessons from these events that apply to software engineering. As do the people that design reliable systems, which is why many of us are arguing for liability for software. Because once producers of software are held liable for their product a large number of the bad practices and avoidable incidents (not just security) would become subject to the Darwinian selection process: bad producers would go out of business.
> Studying a software failure post mortem will be a lot more useful than studying 9/11.
You can learn lots of things from other fields, if you are open to learning in general. Myopically focusing on your own field is useful and can get you places but it will always result in 'deep' approaches, never in 'wide' approaches and for a really important system both of these approaches are valid and complementary.
To make your life easier the author has listed in the right hand column which items from the non-software disasters carry over into the software world, which I think is a valuable service. A middlebrow dismissal of that effort is throwing away an opportunity to learn, for free, from incidents that have all made the history books. And if you don't learn from your own and others' mistakes then you are bound to repeat that history.
Software isn't special in this sense. Not at all. What is special is the arrogance of some software people who believe that their field is so special that they can ignore the lessons from the world around them. And as a corollary: that they can ignore all the lessons already learned in software systems in the past. We are in an eternal cycle of repeating past mistakes with newer and shinier tools and we urgently need to break out of it.
It is 2023. The damage of natural disasters can be mitigated. When the San Andreas fault goes it'll probably get an entry on that list with a "why did we build so much infrastructure on this thing? Why didn't we prepare more for the inevitable?".
And this article is throwing out generic all-weather good sounding platitudes which are tangential to the disasters listed. He drew a comparison between the Challenger disaster and bitrot! Anyone who thinks that is a profound connection should avoid the role of software architect. The link is spurious. Challenger was about catastrophic management and safety practices. Bitrot is neither of those things.
I mean, if we want to learn from Douglas Adams he suggested that we can deduce the nature of all things by studying cupcakes. That is a few steps down the path from this article, but the direction is similar. It is not useful to connect random things in other fields to random things in software. Although I do appreciate the effort the gentleman went to, it is a nice site and the disasters are interesting. Just not relevantly linked to software in a meaningful way.
> We are tied 1:1 to the fate of our star and may well go down with it
I'm just going to claim that is false and live in the smug comfort that when circumstances someday prove you right neither of us will be around to argue about it. And if you can draw lessons from that which apply to practical software development then that is quite impressive.
An Ariane 5 failed because of bitrot, so the headline comparison of rocket failures makes sense. Not testing software with new performance parameters before launch sounds like catastrophic management to me.
> It is 2023. The damage of natural disasters can be mitigated.
That is a comforting belief, but it is probably not true. We have no plan for a near-Earth supernova explosion. Not even in theory.
Then there are asteroid impacts. In theory we could have plowed all of our resources into planetary defences, but in practice in 2023 we can very easily get sucker punched by a bolide and go the way of the dinosaurs.
So? Mistakes are still being made, every day. Nothing has changed since the stone age except for our ability - and hopefully willingness - to learn from previous mistakes. If we want to.
> The damage of natural disasters can be mitigated.
You wish.
> When the San Andreas fault goes it'll probably get an entry on that list with a "why did we build so much infrastructure on this thing? Why didn't we prepare more for the inevitable?".
Excellent questions. And in fairness to the people living on the San Andreas fault - and near volcanoes, in hurricane alley and in countries below sea level - we have an uncanny ability to ignore history.
> And this article is throwing out generic all-weather good sounding platitudes which are tangential to the disasters listed.
I see these errors all the time in the software world, I don't care what hook he uses to again bring them to attention but they are probably responsible for a very large fraction of all software problems.
> He drew a comparison between the Challenger disaster and bitrot!
So let's see your article on this subject then that will obviously do a much better job.
> Anyone who thinks that is a profound connection should avoid the role of software architect.
Do you care? It would be better to say that those that fail to be willing to learn from the mistakes of others should avoid the role of software architect because on balance that's where the problems come from. You seem to have a very narrow viewpoint here: that because you don't like the precision or the links that are being made that you can't appreciate the intent and the subject matter. Of course a better article could have been written and of course you are able to dismiss it entirely because of its perceived shortcomings. But that is exactly the attitude that leads to a lot of software problems: the inability to ingest information when it isn't presented in the recipients preferred form. This throws out the baby with the bath water, the authors intent is to educate you and others on the ways in which software systems break and uses something called a narrative hook to serve as a framework. That these won't match 100% is a given. Spurious connection or not, documentation and actual fact creeping out of spec aka the normalization of deviation in disguise is exactly the lesson from the Challenger disaster and if you don't like the wording I'm looking forward to your improved version.
> Challenger was about catastrophic management and safety practices.
That was a small but critical part in the whole, I highly recommend reading the entire report on the subject, it makes for fascinating reading, there are a great many lessons to be learned from this.
And many useful and interesting supporting documents.
> I mean, if we want to learn from Douglas Adams he suggested that we can deduce the nature of all things by studying cupcakes.
That's a complete nonsensical statement. Have you considered that your initial response to the article precludes you from getting any value from it?
> It is not useful to connect random things in other fields to random things in software.
But they are not random things. The normalization of deviation in whatever guise it comes is the root cause of many, many real world incidents, both in software as well as outside of it. You could argue with the wording, but not with the intent or the connection.
> Although I do appreciate the effort the gentleman went to, it is a nice site and the disasters are interesting. Just not relevantly linked to software in a meaningful way.
To you. But they are.
> > We are tied 1:1 to the fate of our star and may well go down with it
> I'm just going to claim that is false and live in the smug comfort that when circumstances someday prove you right neither of us will be around to argue about it.
So, you are effectively saying that you persist in being wrong simply because the timescale works to your advantage?
> And if you can draw lessons from that which apply to practical software development then that is quite impressive.
Well, for starters I would argue that many software developers indeed create work that serves just long enough to hold until they've left the company and that that attitude is an excellent thing to lose and a valuable lesson to draw from this discussion.
So the article had a list of disasters and some useful lessons learned in its left and center columns. It also had lists of truisms about software engineering in the right column. They had nothing fundamental to do with each other.
For instance, it tries to draw an equivalence between "Titanic's Captain Edward Smith had shown an "indifference to danger [that] was one of the direct and contributing causes of this unnecessary tragedy." and "Leading during the time of a software crisis (think production database dropped, security vulnerability found, system-wide failures etc.) requires a leader who can stay calm and composed, yet think quickly and ACT." which are completely unrelated: one is a statement about needing to evaluate risks to avoid incidents, another is talking about the type of leadership needed once an incident has already happened. Similarly, the discussion about Chernobyl is also confused: the primary lessons there are about operational hygiene, but the article draws "conclusions" about software testing which is in a completely different lifecycle phase.
There are certainly lessons to be learned from past incidents both software and not, but the article linked is a poor place to do so.
So let's take those disasters and list the lessons that you would have learned from them. That's the way to constructively approach an article like this, out-of-hand dismissal is just dumb and unproductive.
FWIW I've seen the leaders of software teams all the way up to the CTO run around like headless chickens during (often self inflicted) crisis. I think the biggest lesson from the Titanic is that you're never invulnerable, even when you have been designed to be invulnerable.
None of these are exhaustive and all of them are open to interpretation. Good, so let's improve on them.
One general takeaway: managing risk is hard, especially when working with a limited budget (which is almost always the case) and just the exercise of assessing and estimating likelihood and impact are already very valuable but plenty of organizations have never done any of that. They simply are utterly blind to the risks their org is exposed to.
Case in point: a company that made in-car boxes that could be upgraded OTA. And nobody thought to verify that the vehicle wasn't in motion...
There are two useful lessons from the Titanic that can apply to software:
1) Marketing that you are super duper and special is meaningless if you've actually built something terrible (the Titanic was not even remotely as unsinkable as claimed, with "water tight" compartments that weren't actually watertight)
2) When people below you tell you "hey we are in danger", listen to them. Don't do things that are obviously dangerous and make zero effort to mitigate the danger. The danger of atlantic icebergs was well understood, and the Titanic was warned multiple times! Yet the captain still had inadequate monitoring, and did not slow down to give the ship more time to react to any threat.
The one hangup with "Listen to people warning you" is that they produce enough false positives as to create a boy who cried wolf effect for some managers.
Yes, that's true. So the hard part is to know who is alarmist and who actually has a point. In the case of NASA the ignoring bit seemed to be pretty wilful. By the time multiple engineers warn you that this is not a good idea and you push on anyway I think you are out of excuses. Single warnings not backed up by data can probably be ignored.
Obviously the larger the disaster the more complex the failure and the harder to analyze the root cause. But one interesting takeaway for me from this list is that all of them were preventable and in all but few of the cases the root cause may have been the trigger but the setup of the environment is what allowed the fault to escalate in the way that it did. In a resilient system faults happen as well, but they do not propagate.
And that's the big secret to designing reliable systems.