Today’s outage for several Google services

mhowland · on Jan 25, 2014

Cliff notes: google's configuration service broke itself then fixed itself today. Engineers were alerted. Skynet is self-aware.

aroman · on Jan 25, 2014

yeah I'm very curious as to what sort of bug deploys a bad configuration and then magically deploys the fix 30 minutes later...

reidrac · on Jan 25, 2014

I don't know the specifics of what happened here, but in my experience with automatic configuration generation one must have a way to validate the config, but that validator can have bugs (as any other software).

Then either the software loading the configuration detects the problem or the monitoring system detects something's not right, and automatically the last working configuration is applied and the non working one is discarded.

By the looks of it I would say their monitoring detected the problem but the reliability team needed some minutes to realise it was a configuration problem. A classic problem is a network appliance that is misbehaving (eg. firewall, switch, etc), but nobody knows it is because of the configuration and it is replaced by a fallback appliance that... oh, has the same problem (configuration).

All together 25 minutes seems a lot, but when you're troubleshooting and you know an important part of your infrastructure is down, time fly!

free652 · on Jan 25, 2014

Probably a race condition that happened in the first deployment and the system ran just fine the second time.

crazytony · on Jan 25, 2014

the bug was probably neatly tucked inside a conditional that was time and/or state related

if state=="x": break_google else: fix_google

jjoe · on Jan 25, 2014

Reads familiar? The outage pattern is identical to the one impacting GitHub January 8 [1] with my comment in the thread discussing the root cause [2]. Systems generating config files and then pushing them out to services within the infrastructure without proper checking and linting. In google's case, I just can't believe their systems are so delicately integrated and such critical component so botched up.

[1] https://github.com/blog/1759-dns-outage-post-mortem

[2] https://news.ycombinator.com/item?id=7081913

magicalist · on Jan 25, 2014

Can you be more specific? They don't sound very similar at all, though to be fair, google didn't provide that many details here. They both involve generated configuration files?

> Systems generating config files and then pushing them out to services within the infrastructure without proper checking and linting.

It's not really that easy, as "proper checking and linting" might as well be phrased as "sufficiently smart checking and linting". You can have amazing checking and linting and still let a bug pass through.

bigiain · on Jan 25, 2014

Modern DevOps - the ability to deploy catastrophic mistakes globally and instantaneously!

(what could possibly go wrong?)

rbanffy · on Jan 25, 2014

I once experienced a similar issue (on a much smaller scale, obviously). Our DNS zone files were generated by a programand, thanks to a subtle bug, we went offline. The catch was that the generated files were valid and no amount of linting would have saved us from syntactically perfect wrong data.

The checks need to be quite sophisticated. We patched the generator, but never got to implement a test deploy mechanism to ensure the config made sense before deployment to a real production environment.

brianr · on Jan 25, 2014

Also reminded me of this CloudFlare outage from last year, where a bad configuration rule caused their routers to crash: http://blog.cloudflare.com/todays-outage-post-mortem-82515

In that instance the rule was written by a human, but deployed automatically to their entire infrastructure before they could notice the problem it caused.

antimora · on Jan 25, 2014

Sounds like a problem what Dropbox had experienced recently too: https://tech.dropbox.com/2014/01/outage-post-mortem/

rachelbythebay · on Jan 25, 2014

I wonder if it was gen_aggregator.py again, 6 years later.

roskilli · on Jan 25, 2014

The main thing that comes to mind is why they do not deploy these kind of changes to a small slice and smoke test the slice before deploying to all users? This seems to be a pretty common routine for services at scale nowdays...?

pfg · on Jan 25, 2014

I'm sure they use Canary Deployments, Gradual Rollouts and what have you to update their services.

I suppose this is a hard problem to solve on a configuration change level though. Imagine the configuration change that triggered the bug was something like "hey load balancers, stop sending traffic to the cluster with that new version of service X which seems to cause elevated error rates." You don't really want that kind of change to take too long to propagate.

spiderPig · on Jan 25, 2014

Seems like the bug was in the config generator/deployer and not the service itself. So it's quite possible that things behaved normally during their dogfooding/smoke testing phase. But you're right, the config should've probably been rolled out region-by-region with a short baking period inbetween instead of a global outage.

gfodor · on Jan 25, 2014

It's almost certain that they do in general do this and the fact that this is not what happened is part of the issue. Any blog post describing something like this is going to leave out details such as this.

dudus · on Jan 25, 2014

Depending on what these configuration files are used for it might not be ideal to update only part of the clusters. That might leave the system in an inconsistent way.

cwyers · on Jan 25, 2014

I gotta admit, even though I already knew this in the back of my head, the most surprising thing about this is that Google still uses Blogspot for stuff.

yeukhon · on Jan 25, 2014

How is that surprising? Tumblr uses tumblr to post announcement and Google owns Blogspot of course would use blogspot to make announcement.

cwyers · on Jan 25, 2014

Honestly, I keep forgetting that Google owns Blogspot, mostly because I keep forgetting that Blogspot still exists.

rahimnathwani · on Jan 25, 2014

You may visit a site hosted on blogspot more often than you think. Blogspot is blocked where I live (China) so I notice immediately when I click on a blogspot link on HN. If I didn't have to deliberately turn on VPN, I might just read the post without bothering to notice where it's hosted.

yeukhon · on Jan 25, 2014

True. Matt Green for example uses blogspot too. http://blog.cryptographyengineering.com/

pycon blog is also on blogspot. http://pycon.blogspot.com/

jlgaddis · on Jan 26, 2014

Back in the day, we used to TEST configurations on a few machines before blowing them out to, you know, EVERYTHING!

ape4 · on Jan 25, 2014

Having a system then sends out config files sounds like part of the body that send out hormones,etc. Both can go wrong.