Hacker News new | past | comments | ask | show | jobs | submit login

This response indicates lack of responsibility to me.

Human failure is unacceptable when we know it happens, and should do everything possible to guard against it. You say "guess at some we dont". Well, we do know all about human failure.

All that sympathy for that bloke who "fired himself", because everyone here agreed that his potential human error should have been anticipated, but not so in this case? Seems to me we are applying different standards.

Most of all, what I dont like is universal get outs. Its reminds me of the worst lie of all: "Sorry Sir, its a computer error".




> You say "guess at some we dont". Well, we do know all about human failure.

You miss the point. We can continue to merely enumerate possible error scenarios until the heat death of the universe, and we will still miss some.

It is "human error" for an operations team to not put in place methods for ensuring their systems stay up within agreed parameters.

But the reality is that it is not even theoretically possible for us to engineer a system which can guarantee no downtime. Furthermore, no organization is willing to pay the bill to address even a relatively small fraction of the problems we can easily predict for the reason that many even relatively likely reasons are more expensive to protect against than is worth.

So to begin with, we can't prevent failure. And even if we could, what from the outside looks like human error is often internally a result of either intentional budgetary constraints, or unintended consequences of lack of resources.

It is not about lack of responsibility. It about dispelling the fantasy that there is someone who is guilty of not doing their job correctly behind every failure.

That is not to say that there might not have been unacceptable human errors in this specific case. But that is entirely besides the point.

> but not so in this case?

I thought it was pretty clear that my comment applied to the general statement that "human error is unacceptable", but perhaps not. I explicitly wrote "It's very well possible that Cloudflare messed up here" because I didn't want to speculate on whether the specific problems in this case before the causes were even known.

> Most of all, what I dont like is universal get outs.

They are not "universal get outs". Nobody is going to say it is acceptable if the error is caused by someone bringing coffee into the ops room and spilling it all over the single server, for example. But there is a vast range between someone who is grossly negligent and/or incompetent and who should bear the blame, and someone who is doing their job as best as is reasonable given the resources available to them, but who still makes mistakes or oversights or simply don't have the time or resources to address some reasonably unlikely issue that eventually happens to cause downtime.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: