Hacker News new | past | comments | ask | show | jobs | submit login

You're still missing the point, by assuming that the outages are code related when you call them avoidable. Distributed systems at scale are terrifyingly hard to control.

When I said "probably a fairly rare one", I didn't mean that the outages are rare. I meant that new code is probably a rare cause of the outages that happen. They have other causes unrelated to new code.

(I'm also skeptical that GitHub major outages happen "nearly weekly", but I don't have data.)




I'm not "missing" anything. I worked at Google for 7 years much of which was spent working on, you guessed it, distributed systems infrastructure. You guard against this by carefully canarying things and putting robust testing, monitoring, and deployment procedures in place. A release might take a few days, but you can be reasonably certain your users won't be your guinea pigs, and if shit does hit the fan, rollback is easy, and you can reroute traffic elsewhere while you roll back. Most of the time no rollback is needed: you just flip a flag and do a rolling restart on the job in Borg. For some types of outages (most of which users never even see) Google has bots that calculate monetary loss. And the figures can be quite staggering and motivating, so people do postmortems and try their best to make sure the outages don't happen again.


So no Google service has ever experienced an outage? I distinctly remember Gmail being down on several occasions.


Gmail is several orders of magnitude larger than Github will ever be, and in recent memory I can only recall it being down once, and for a very small subset of users.


You're too advanced for the typical reader.

It's a startup site with half the people not having a test environment.


Everyone has a test environment; some people happen to send prod traffic to it.


Why are you continuing to assume that this outage was caused by a release of some kind?


Every change is a release if you squint right.


I'm not even assuming this outage was caused by a "change". There are DoS attacks, infrastructure/network outages, storage pool problems.. immediately assuming that someone pushed some code and it broke things seems like an extremely short-sighted view on how production systems fail.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: