How many more can we expect before they develop appreciation for testing _before...

cjbprime · on July 31, 2017

Newly introduced errors in site code are only one of many sources of failure for a site like GitHub, and probably a fairly rare one.

0xbear · on July 31, 2017

It's a nearly weekly occurrence. Avoidable one, I might add. When was the last time a major Google service had a major outage?

cjbprime · on July 31, 2017

You're still missing the point, by assuming that the outages are code related when you call them avoidable. Distributed systems at scale are terrifyingly hard to control.

When I said "probably a fairly rare one", I didn't mean that the outages are rare. I meant that new code is probably a rare cause of the outages that happen. They have other causes unrelated to new code.

(I'm also skeptical that GitHub major outages happen "nearly weekly", but I don't have data.)

0xbear · on July 31, 2017

I'm not "missing" anything. I worked at Google for 7 years much of which was spent working on, you guessed it, distributed systems infrastructure. You guard against this by carefully canarying things and putting robust testing, monitoring, and deployment procedures in place. A release might take a few days, but you can be reasonably certain your users won't be your guinea pigs, and if shit does hit the fan, rollback is easy, and you can reroute traffic elsewhere while you roll back. Most of the time no rollback is needed: you just flip a flag and do a rolling restart on the job in Borg. For some types of outages (most of which users never even see) Google has bots that calculate monetary loss. And the figures can be quite staggering and motivating, so people do postmortems and try their best to make sure the outages don't happen again.

rocketbop · on July 31, 2017

So no Google service has ever experienced an outage? I distinctly remember Gmail being down on several occasions.

0xbear · on July 31, 2017

Gmail is several orders of magnitude larger than Github will ever be, and in recent memory I can only recall it being down once, and for a very small subset of users.

user5994461 · on July 31, 2017

You're too advanced for the typical reader.

It's a startup site with half the people not having a test environment.

concede_pluto · on July 31, 2017

Everyone has a test environment; some people happen to send prod traffic to it.

cjbprime · on July 31, 2017

Why are you continuing to assume that this outage was caused by a release of some kind?

0xbadcafebee · on Aug 1, 2017

Every change is a release if you squint right.

cjbprime · on Aug 1, 2017

I'm not even assuming this outage was caused by a "change". There are DoS attacks, infrastructure/network outages, storage pool problems.. immediately assuming that someone pushed some code and it broke things seems like an extremely short-sighted view on how production systems fail.

kyberias · on July 31, 2017

They don't do any testing before push to prod?

0xbear · on July 31, 2017

It's called continuous delivery, man. Merge to master and it goes out the door automatically, even if it's busted to hell. All the cool kids are doing it.

kyberias · on Aug 1, 2017

I know what it's called. So do they have tests or not?

notyourday · on Aug 1, 2017

How was it merged to master if it did not pass the tests?

0xbear · on Aug 1, 2017

Shitty tests? Lack of integration tests? Lack of test coverage for a particular scenario? Test environment that does not represent the config that's deployed in production? There are literally hundreds of reasons why things like that could happen. Depending on the system in question you could have an isolated environment on which you replay prod traffic before cutting a new release and then investigate failures. All new features are engineered so that they could be easily turned off using flags. Once that's done you could canary your release to, say, 5% of your user base (at Google it'd be in the least loaded Borg cell), and if something goes wrong, you quickly disable busted features in response to monitoring and alerts. You let that stew for a while, then deploy your release worldwide, and start working on the next one.

notyourday · on Aug 1, 2017

Let me get it straight... So the advantage of CI/CD is not automatic testing + roll outs, rather it is removal of a test requirement? Why waste resources on CI/CD in that case? Just remove the test requirements and deploy. In fact, remove canary as well - the more traffic hits the broken release the faster it will become obvious that release is broken.

The above was sarcasm.

If your org has CD and it does not have CI validation step that provides 100% test coverage of your code, then you do not have a CD - you have MSP - Merge->Ship->Pray system in place.

0xbear · on Aug 1, 2017

I'm not sure how you got that out of what I wrote. If anything, it's a recognition that unit tests alone are never sufficient, and _drastic increase_ of testing effort, _in addition_ to CI. Humans are in the loop. Testing is near exhaustive, because it happens in an environment nearly identical to prod, with a copy of prod traffic. Users just don't see the results. Compare that to just "push and pray" approach of a typical CD setup.

notyourday · on Aug 1, 2017

I'm sorry, if you are adding a new feature, then no old features could break without unit and integration tests indicating brokenness.

What one should do is push the new feature dark, i.e. not enabled for anyone except the automated test suite user. That's the user that should exercise all old paths and validate that no old path is broken when the feature is enabled by a user but unused. After that is validated in production one can enable the feature for either a percentage or users or 100% of users depending of how one is playing to live test and live QA the feature.

The important part is that no new release can break existing usage patterns.

That's CI/CD. Everything else is magic, unicorns and rainbows.

0xbear · on Aug 1, 2017

The crucial difference is that CD postulates that if a change passes your automated test suite, it's good enough to immediately go live. I've dealt with many complex systems, some with pretty exhaustive test coverage, and this wasn't true in any of them. Unless your service is completely brain-dead simple (which none of them are once you have to scale), you always need a human looking at the release and driving it, and turning things off if tests missed bugs.

notyourday · on Aug 1, 2017

That's the exact argument that was used by sysadmins to explain why their jobs could not be automated.

0xbear · on Aug 2, 2017

And last I checked their jobs aren't, in fact, automated. They just moved to the various cloud providers and were renamed to "SRE", with a major boost in pay.

toyg · on July 31, 2017

Pushing changes on Monday morning... way to ruin everyone's week!