Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The worst thing for any service provider is to delay the initial confirmation that yes, something seems to be going wrong.

> http://blog.dotcloud.com/working-around-the-ec2-outage

"After one hour, Amazon's Health Dashboard was still pretending that everything was right."

There really is no negative side for someone as large as Amazon to immediately put up a quick notice that "we are receiving complaints about x-y-z and looking into it." I get pinged like crazy within minutes of one my client's servers slowing/going down - I can't imagine that Amazon doesn't know something is wrong within seconds. If it turns out that it wasn't AWS but something else (Level 3 pipe issues or Comcast DNS errors) then just clarify that later. Why make every single AWS customer panic for an hour fearing that the fault lies within their code/services?



There really is no negative side for someone as large as Amazon to immediately put up a quick notice that "we are receiving complaints about x-y-z

Do you have an idea how many such complaints amazon is receiving on a normal day? Per hour?

Why make every single AWS customer panic for an hour

Diagnosing problems in a big system is not that easy.

A turnaround time of an hour is not too bad for a behemoth the size of Amazon, and when you consider that this was a worst-case scenario.


Even with a large service, you can immediately identify a fluctuation in the number of complaints (in addition to signals from monitoring tools). Speaking from experience. In fact, the larger the service is, the easier it is to statistically identify an uptick in the number of complaints per smaller time interval


Even with a large service, you can immediately identify a fluctuation in the number of complaints

Immediately is a relative term. I would say "1 hour" is pretty much as immediately as it gets on these scales.

I wouldn't be surprised if a significant complaint fluctuation only manifested long after amazon discovered the problem in their own monitoring.

statistically identify an uptick in the number of complaints per smaller time interval

Yes, but volume is not everything. You also have to qualify (triage) the input, get engineers on the case, confirm the issue, perhaps get clearance for a public announcement. All the while many of the key people are busy either trying to figure out what is going on, or trying to dispatch information the right people, or just running around waving their arms furiously.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: