And then you find an obscure service doesn’t come back up on the 10,000th or 100,000th reboot because of <any number of reasons>. And now you have multiple states, because you have to handle failover. It’s turtles all the way down.
It’s always easy to say that in hindsight. But keep in mind this is an environment with many core components built in the 80s. Regular reboots on old AIX systems wasn’t a common practice - the sheer uptime capability of these systems was a big selling point in an environment that looks nothing like a modern cloud architecture.
But none of that is really the point. The point is that even with every correct procedure in place, you’ll still encounter failures.
Modern dev teams in companies that build software have more checks and balances in place from the get go that help head off some categories of failure.
But when an organization is built on core tech born of the 80s/90s, there will always be dragons, regardless of the current active policies and procedures.
The problem is that the cost to replace some of these systems was inestimable.