Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Most of what they actually said via the manual human-language status updates was "Service X is seeing elevated error rates".

While there are still decisions to be made in how you monitor errors and what sorts of elevated rates merit an alert -- I would bet that AWS has internally-facing systems that can display service health in this way based on automated monitoring of error rates (as well as other things). Because they know it means something.

They apparently choose to make their public-facing service health page only show alerts via a manual process that often results in an update only several hours after lots of customers have noticed problems. This seems like a choice.

What's the point of a status page? To me, the point of it is, when I encounter a problem (perhaps noticed because of my own automated monitoring), one of the first thing I want to do is distinguish between a problem that's out of my control on the platform, and a problem that is under my control and I can fix.

A status page that does not support me in doing that is not fulfilling it's purpose. the AWS status page fails to help customers do that, by regularly showing all green with no alerts hours after widespread problems occured.



As mentioned in the article, internal metrics were fubar most of the day.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: