"After one hour, Amazon's Health Dashboard was still pretending that everything was right."
There really is no negative side for someone as large as Amazon to immediately put up a quick notice that "we are receiving complaints about x-y-z and looking into it." I get pinged like crazy within minutes of one my client's servers slowing/going down - I can't imagine that Amazon doesn't know something is wrong within seconds. If it turns out that it wasn't AWS but something else (Level 3 pipe issues or Comcast DNS errors) then just clarify that later. Why make every single AWS customer panic for an hour fearing that the fault lies within their code/services?
Even with a large service, you can immediately identify a fluctuation in the number of complaints (in addition to signals from monitoring tools). Speaking from experience. In fact, the larger the service is, the easier it is to statistically identify an uptick in the number of complaints per smaller time interval
Even with a large service, you can immediately identify a fluctuation in the number of complaints
Immediately is a relative term. I would say "1 hour" is pretty much as immediately as it gets on these scales.
I wouldn't be surprised if a significant complaint fluctuation only manifested long after amazon discovered the problem in their own monitoring.
statistically identify an uptick in the number of complaints per smaller time interval
Yes, but volume is not everything. You also have to qualify (triage) the input, get engineers on the case, confirm the issue, perhaps get clearance for a public announcement. All the while many of the key people are busy either trying to figure out what is going on, or trying to dispatch information the right people, or just running around waving their arms furiously.
> http://blog.dotcloud.com/working-around-the-ec2-outage
"After one hour, Amazon's Health Dashboard was still pretending that everything was right."
There really is no negative side for someone as large as Amazon to immediately put up a quick notice that "we are receiving complaints about x-y-z and looking into it." I get pinged like crazy within minutes of one my client's servers slowing/going down - I can't imagine that Amazon doesn't know something is wrong within seconds. If it turns out that it wasn't AWS but something else (Level 3 pipe issues or Comcast DNS errors) then just clarify that later. Why make every single AWS customer panic for an hour fearing that the fault lies within their code/services?