This has nothing to do with excuses - I’m challenging the assertion that “because something bad happened, they must not have any mitigations in place at all”.
This seems like a bad case of binary thinking, and my point was that the occurrence of an incident like this is not sufficient to support that claim. It’s just as likely that an ancient process that wasn’t accounted for somewhere in the architecture broke down, and this is how it manifested.
Clearly improvements are needed, as is always the case after an outage. That doesn’t justify wild speculation.
Anecdote time: I once worked for a large financial institution that makes money when people swipe their credit cards. The system that authorizes purchases is ancient, battle tested, and undergoes minimal change because the cost of an outage could be measured in the millions of $ per minute.
Every change was scrutinized, reviewed by multiple groups, discussed with executives, and tested thoroughly. The same system underwent regular DR testing that involved quite a lot of involvement from all related teams.
So the day it went down, it was obviously a big deal, and raised all of the natural questions about how such a thing could occur.
Turns out it had an unknown transitive dependency on an internal server - a server that had not been rebooted in literally a decade. When that server was rebooted (I think it was a security group insisting it needed patches despite some strong reasons to avoid that when considering the architecture), some of the services never came back up, and everyone quickly learned that a very old change that predated almost everyone there established this unknown dependency.
The point of this story is really about the unknowability of sufficiently complex legacy enterprise systems.
All of the right processes and procedures won’t necessarily account for that seemingly inconsequential RPC call to an internal system implemented by a grizzled dev shortly before his retirement.
And then you find an obscure service doesn’t come back up on the 10,000th or 100,000th reboot because of <any number of reasons>. And now you have multiple states, because you have to handle failover. It’s turtles all the way down.
It’s always easy to say that in hindsight. But keep in mind this is an environment with many core components built in the 80s. Regular reboots on old AIX systems wasn’t a common practice - the sheer uptime capability of these systems was a big selling point in an environment that looks nothing like a modern cloud architecture.
But none of that is really the point. The point is that even with every correct procedure in place, you’ll still encounter failures.
Modern dev teams in companies that build software have more checks and balances in place from the get go that help head off some categories of failure.
But when an organization is built on core tech born of the 80s/90s, there will always be dragons, regardless of the current active policies and procedures.
The problem is that the cost to replace some of these systems was inestimable.
This seems like a bad case of binary thinking, and my point was that the occurrence of an incident like this is not sufficient to support that claim. It’s just as likely that an ancient process that wasn’t accounted for somewhere in the architecture broke down, and this is how it manifested.
Clearly improvements are needed, as is always the case after an outage. That doesn’t justify wild speculation.
Anecdote time: I once worked for a large financial institution that makes money when people swipe their credit cards. The system that authorizes purchases is ancient, battle tested, and undergoes minimal change because the cost of an outage could be measured in the millions of $ per minute.
Every change was scrutinized, reviewed by multiple groups, discussed with executives, and tested thoroughly. The same system underwent regular DR testing that involved quite a lot of involvement from all related teams.
So the day it went down, it was obviously a big deal, and raised all of the natural questions about how such a thing could occur.
Turns out it had an unknown transitive dependency on an internal server - a server that had not been rebooted in literally a decade. When that server was rebooted (I think it was a security group insisting it needed patches despite some strong reasons to avoid that when considering the architecture), some of the services never came back up, and everyone quickly learned that a very old change that predated almost everyone there established this unknown dependency.
The point of this story is really about the unknowability of sufficiently complex legacy enterprise systems.
All of the right processes and procedures won’t necessarily account for that seemingly inconsequential RPC call to an internal system implemented by a grizzled dev shortly before his retirement.