It is kinda amazing how consistently status pages show everything fine during a ...

blinded · 2024-08-14T23:10:21 1723677021

From my experience this requires a few steps happen first:

- an incident be declared internally to github

- support / incident team submits a new status page entry (with details on service(s) impact(ed))

- incident is worked on internally

- incident fixed

- page updated

- retro posted

Even aws now seems to have some automation for their various services per region. But it doesn't automatically show issues because it could be at the customer level or subset of customers, or subset of customers if they are in region foo in AZ bar, on service version zed vs zed - 1. So they chose not to display issues for subsets.

I do agree it would be nice to have logins for the status page and then get detailed metrics based on customerid or userid. Someone start a company to compete with statuspage.

cortesoft · 2024-08-14T23:08:40 1723676920

There is always going to be SOME delay between the outage and the status page, although 5 minutes is probably enough time where it should be updated

thund · 2024-08-14T23:12:06 1723677126

after several minutes the status page is still showing all is fine.

For a service like GH, anything more than 30 secs is unacceptable

x86a · 2024-08-14T23:40:41 1723678841

That is very unrealistic. Infrastructure monitoring at that scale won't even be collecting metrics at that interval.

And simple HTTP monitoring would be too flappy for a public status page.

aeonik · 2024-08-15T11:38:30 1723721910

What monitoring tools are you using? I know a ton that can do 30 seconds or less at scale. I'm fact, I'm pretty sure all the big players can do that.

frabjoused · 2024-08-14T23:07:38 1723676858

It's simply too soon for the status page to report the anomaly, is my guess. It's been down for 4 minutes.

thih9 · 2024-08-14T23:12:10 1723677130

4 minutes is a long time for something that could have been an automated check.

For the record, the status page eventually got updated - around 7 minutes after this submission was created.

owyn · 2024-08-15T01:33:37 1723685617

Once in the past I did actually have an incident where the site went down so hard that the tool that we used to update the status page didn't work. We did move it to a totally external and independent service after that. The first service we used was more flaky than our actual site was, so it kept showing the site down when it wasn't. So then we moved to another one, etc. Job security. :)

beefsack · 2024-08-15T06:08:36 1723702116

They say you shouldn't host status pages on the same infrastructure that it is monitoring, but in a way that makes it much more accurate and responsive in outages!