Hacker News new | past | comments | ask | show | jobs | submit login

How many more can we expect before they develop appreciation for testing _before_ they push to prod?



Newly introduced errors in site code are only one of many sources of failure for a site like GitHub, and probably a fairly rare one.


It's a nearly weekly occurrence. Avoidable one, I might add. When was the last time a major Google service had a major outage?


You're still missing the point, by assuming that the outages are code related when you call them avoidable. Distributed systems at scale are terrifyingly hard to control.

When I said "probably a fairly rare one", I didn't mean that the outages are rare. I meant that new code is probably a rare cause of the outages that happen. They have other causes unrelated to new code.

(I'm also skeptical that GitHub major outages happen "nearly weekly", but I don't have data.)


I'm not "missing" anything. I worked at Google for 7 years much of which was spent working on, you guessed it, distributed systems infrastructure. You guard against this by carefully canarying things and putting robust testing, monitoring, and deployment procedures in place. A release might take a few days, but you can be reasonably certain your users won't be your guinea pigs, and if shit does hit the fan, rollback is easy, and you can reroute traffic elsewhere while you roll back. Most of the time no rollback is needed: you just flip a flag and do a rolling restart on the job in Borg. For some types of outages (most of which users never even see) Google has bots that calculate monetary loss. And the figures can be quite staggering and motivating, so people do postmortems and try their best to make sure the outages don't happen again.


So no Google service has ever experienced an outage? I distinctly remember Gmail being down on several occasions.


Gmail is several orders of magnitude larger than Github will ever be, and in recent memory I can only recall it being down once, and for a very small subset of users.


You're too advanced for the typical reader.

It's a startup site with half the people not having a test environment.


Everyone has a test environment; some people happen to send prod traffic to it.


Why are you continuing to assume that this outage was caused by a release of some kind?


Every change is a release if you squint right.


I'm not even assuming this outage was caused by a "change". There are DoS attacks, infrastructure/network outages, storage pool problems.. immediately assuming that someone pushed some code and it broke things seems like an extremely short-sighted view on how production systems fail.


They don't do any testing before push to prod?


It's called continuous delivery, man. Merge to master and it goes out the door automatically, even if it's busted to hell. All the cool kids are doing it.


I know what it's called. So do they have tests or not?


How was it merged to master if it did not pass the tests?


Shitty tests? Lack of integration tests? Lack of test coverage for a particular scenario? Test environment that does not represent the config that's deployed in production? There are literally hundreds of reasons why things like that could happen. Depending on the system in question you could have an isolated environment on which you replay prod traffic before cutting a new release and then investigate failures. All new features are engineered so that they could be easily turned off using flags. Once that's done you could canary your release to, say, 5% of your user base (at Google it'd be in the least loaded Borg cell), and if something goes wrong, you quickly disable busted features in response to monitoring and alerts. You let that stew for a while, then deploy your release worldwide, and start working on the next one.


Let me get it straight... So the advantage of CI/CD is not automatic testing + roll outs, rather it is removal of a test requirement? Why waste resources on CI/CD in that case? Just remove the test requirements and deploy. In fact, remove canary as well - the more traffic hits the broken release the faster it will become obvious that release is broken.

The above was sarcasm.

If your org has CD and it does not have CI validation step that provides 100% test coverage of your code, then you do not have a CD - you have MSP - Merge->Ship->Pray system in place.


I'm not sure how you got that out of what I wrote. If anything, it's a recognition that unit tests alone are never sufficient, and _drastic increase_ of testing effort, _in addition_ to CI. Humans are in the loop. Testing is near exhaustive, because it happens in an environment nearly identical to prod, with a copy of prod traffic. Users just don't see the results. Compare that to just "push and pray" approach of a typical CD setup.


I'm sorry, if you are adding a new feature, then no old features could break without unit and integration tests indicating brokenness.

What one should do is push the new feature dark, i.e. not enabled for anyone except the automated test suite user. That's the user that should exercise all old paths and validate that no old path is broken when the feature is enabled by a user but unused. After that is validated in production one can enable the feature for either a percentage or users or 100% of users depending of how one is playing to live test and live QA the feature.

The important part is that no new release can break existing usage patterns.

That's CI/CD. Everything else is magic, unicorns and rainbows.


The crucial difference is that CD postulates that if a change passes your automated test suite, it's good enough to immediately go live. I've dealt with many complex systems, some with pretty exhaustive test coverage, and this wasn't true in any of them. Unless your service is completely brain-dead simple (which none of them are once you have to scale), you always need a human looking at the release and driving it, and turning things off if tests missed bugs.


That's the exact argument that was used by sysadmins to explain why their jobs could not be automated.


And last I checked their jobs aren't, in fact, automated. They just moved to the various cloud providers and were renamed to "SRE", with a major boost in pay.


Pushing changes on Monday morning... way to ruin everyone's week!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: