Human error is unacceptable. Online services should never have a single point of failure which can take their entire global presence offline. Not only because it minimizes the risk of catastrophic accidents, but also because hardware does occasionally break.
This is why every piece of kid should be mirrored with a redundant backup and why many businesses even have entire duplicated standby systems for such disasters. Even if that gear cannot support the entire infrastructure, it's usually enough to at least publish an official status page. Having to use Twitter to update users is just amateurish in my opinion.
No systems are immune to failure. No matter how much redundancy you have, chances are you have interdependencies you did not anticipate, and sooner or later run into failure scenarios that violate your expectations.
It's very well possible that Cloudflare messed up here, but to claim so categorically that "human error is unacceptable" is a bit of a joke. We build systems to withstand the risks we know about, and guess at some we don't.
But the number of possible failure scenarios we don't understand properly is pretty much infinite.
Losing DNS was fairly unforgivable. DNS as a protocol is designed to make it easy to deal with server and network outages (to the point of losing netblocks from the global routing tables). They added anycast DNS, which is great, but didn't split their DNS into multiple anycast netblocks.
I run a number of DNS servers myself. And yes, it was "fairly unforgivable" to mess that part up.
But what I was addressing was the blanket claim that human error is unacceptable. Anyone who runs a setup much larger than a calculator will deal with human errors - whether actual operational errors or human inability to engineer for resilience against all possible but unlikely scenarios - on a regular basis.
Some things should be harder than others to break, and DNS are amongst them. Cloudflare no doubt have plenty of lessons to learn. But so have everyone else.
> I run a number of DNS servers myself. And yes, it was "fairly unforgivable" to mess that part up.
So then you basically agree with my point <_<
> But what I was addressing was the blanket claim that human error is unacceptable. Anyone who runs a setup much larger than a calculator will deal with human errors - whether actual operational errors or human inability to engineer for resilience against all possible but unlikely scenarios - on a regular basis.
You're twisting my words and taking them out of context. I was saying that human error is an unacceptable excuse for the entire stack of a company the size of Cloudflare going off line. And I was saying that because redundancy systems should act as a "safety net" so that administrators can make human errors. I've lost count of the number of dumb mistakes I've made over the years, but each time I've been able to switch to a back up system while I've worked towards undoing my cock up. And you said yourself that a complete DNS outage was unacceptable, so clearly you and I are more or less on the same page regarding this.
This response indicates lack of responsibility to me.
Human failure is unacceptable when we know it happens, and should do everything possible to guard against it. You say "guess at some we dont". Well, we do know all about human failure.
All that sympathy for that bloke who "fired himself", because everyone here agreed that his potential human error should have been anticipated, but not so in this case? Seems to me we are applying different standards.
Most of all, what I dont like is universal get outs. Its reminds me of the worst lie of all: "Sorry Sir, its a computer error".
> You say "guess at some we dont". Well, we do know all about human failure.
You miss the point. We can continue to merely enumerate possible error scenarios until the heat death of the universe, and we will still miss some.
It is "human error" for an operations team to not put in place methods for ensuring their systems stay up within agreed parameters.
But the reality is that it is not even theoretically possible for us to engineer a system which can guarantee no downtime. Furthermore, no organization is willing to pay the bill to address even a relatively small fraction of the problems we can easily predict for the reason that many even relatively likely reasons are more expensive to protect against than is worth.
So to begin with, we can't prevent failure. And even if we could, what from the outside looks like human error is often internally a result of either intentional budgetary constraints, or unintended consequences of lack of resources.
It is not about lack of responsibility. It about dispelling the fantasy that there is someone who is guilty of not doing their job correctly behind every failure.
That is not to say that there might not have been unacceptable human errors in this specific case. But that is entirely besides the point.
> but not so in this case?
I thought it was pretty clear that my comment applied to the general statement that "human error is unacceptable", but perhaps not. I explicitly wrote "It's very well possible that Cloudflare messed up here" because I didn't want to speculate on whether the specific problems in this case before the causes were even known.
> Most of all, what I dont like is universal get outs.
They are not "universal get outs". Nobody is going to say it is acceptable if the error is caused by someone bringing coffee into the ops room and spilling it all over the single server, for example. But there is a vast range between someone who is grossly negligent and/or incompetent and who should bear the blame, and someone who is doing their job as best as is reasonable given the resources available to them, but who still makes mistakes or oversights or simply don't have the time or resources to address some reasonably unlikely issue that eventually happens to cause downtime.
> This response indicates lack of experience to me.
I could say the same in return. I've worked in a few data centres and they've all put redundancy at the forefront of their design. (even to the point of having multiple physical fibre paths for fail over).
> No systems are immune to failure. No matter how much redundancy you have
You have that backwards. No systems are immune to failure, which is why you have redundancy.
> chances are you have interdependencies you did not anticipate, and sooner or later run into failure scenarios that violate your expectations.
If the dependencies haven't been anticipated then someone isn't doing their job right. There's a reason why incident response / disaster recovery / business continuity plans are written. It should be someone's job to think up every "what if" scenario ranging from each and every bit of kit dying, all your staff winning the lottery and walking out the next day and even terrorist attacks. I've even had to account for what would happen if nukes were dropped on the city where our main data center was housed (though the answer to that was a simple one: nobody would care that our site went offline). It might sound clichéd, but people get paid to expect the unexpected and work out how to maintain business continuity.
> It's very well possible that Cloudflare messed up here, but to claim so categorically that "human error is unacceptable" is a bit of a joke. We build systems to withstand the risks we know about, and guess at some we don't.
This was their infrastructure failing. If you own and maintain the infrastructure then you have no excuse not to work out what might happen if each and every part of that infrastructure failed. (trust me, I have had to do this in my last two jobs - despite your accusations of my "lack of experience" ;) ).
> But the number of possible failure scenarios we don't understand properly is pretty much infinite.
You're confusing cause and effect. The number of different causes for failure is infinite. But the effect is finite. For example: a server could crash for any number of reasons (hardware, software, user error, and all the different ways within those categories), but the end result is the same; the server has crashed. Thus what you do is plan for situations when different services fail (staff do not turn up for work, your domain name services stop responding, etc) and plan some kind of redundancy around that, thus giving a little more breathing time for engineers to fix the issue and with the minimum possible disruption to your users. As Cloudflare had to resort to Twitter to update their users, they completely failed every possible aspect of such planning. And given the high profile sites that depend on Cloudflare, they have no excuses.
If this happened in any of the other companies I worked for, I'd genuinely be fearful for my job as a crash of that magnitude would me that I hadn't done my job properly.
This is why every piece of kid should be mirrored with a redundant backup and why many businesses even have entire duplicated standby systems for such disasters. Even if that gear cannot support the entire infrastructure, it's usually enough to at least publish an official status page. Having to use Twitter to update users is just amateurish in my opinion.