Hacker News new | past | comments | ask | show | jobs | submit login

It is kinda amazing how consistently status pages show everything fine during a total outage. It's not that hard to connect a status page to end-to-end monitoring statistics...



From my experience this requires a few steps happen first:

- an incident be declared internally to github

- support / incident team submits a new status page entry (with details on service(s) impact(ed))

- incident is worked on internally

- incident fixed

- page updated

- retro posted

Even aws now seems to have some automation for their various services per region. But it doesn't automatically show issues because it could be at the customer level or subset of customers, or subset of customers if they are in region foo in AZ bar, on service version zed vs zed - 1. So they chose not to display issues for subsets.

I do agree it would be nice to have logins for the status page and then get detailed metrics based on customerid or userid. Someone start a company to compete with statuspage.


There is always going to be SOME delay between the outage and the status page, although 5 minutes is probably enough time where it should be updated


after several minutes the status page is still showing all is fine.

For a service like GH, anything more than 30 secs is unacceptable


That is very unrealistic. Infrastructure monitoring at that scale won't even be collecting metrics at that interval.

And simple HTTP monitoring would be too flappy for a public status page.


What monitoring tools are you using? I know a ton that can do 30 seconds or less at scale. I'm fact, I'm pretty sure all the big players can do that.


It's simply too soon for the status page to report the anomaly, is my guess. It's been down for 4 minutes.


4 minutes is a long time for something that could have been an automated check.

For the record, the status page eventually got updated - around 7 minutes after this submission was created.


Once in the past I did actually have an incident where the site went down so hard that the tool that we used to update the status page didn't work. We did move it to a totally external and independent service after that. The first service we used was more flaky than our actual site was, so it kept showing the site down when it wasn't. So then we moved to another one, etc. Job security. :)


They say you shouldn't host status pages on the same infrastructure that it is monitoring, but in a way that makes it much more accurate and responsive in outages!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: