It is kinda amazing how consistently status pages show everything fine during a total outage. It's not that hard to connect a status page to end-to-end monitoring statistics...
From my experience this requires a few steps happen first:
- an incident be declared internally to github
- support / incident team submits a new status page entry (with details on service(s) impact(ed))
- incident is worked on internally
- incident fixed
- page updated
- retro posted
Even aws now seems to have some automation for their various services per region. But it doesn't automatically show issues because it could be at the customer level or subset of customers, or subset of customers if they are in region foo in AZ bar, on service version zed vs zed - 1. So they chose not to display issues for subsets.
I do agree it would be nice to have logins for the status page and then get detailed metrics based on customerid or userid. Someone start a company to compete with statuspage.
Once in the past I did actually have an incident where the site went down so hard that the tool that we used to update the status page didn't work. We did move it to a totally external and independent service after that. The first service we used was more flaky than our actual site was, so it kept showing the site down when it wasn't. So then we moved to another one, etc. Job security. :)
They say you shouldn't host status pages on the same infrastructure that it is monitoring, but in a way that makes it much more accurate and responsive in outages!