Hacker News new | past | comments | ask | show | jobs | submit login
Today’s outage for several Google services (googleblog.blogspot.com)
80 points by panarky on Jan 25, 2014 | hide | past | favorite | 24 comments



Cliff notes: google's configuration service broke itself then fixed itself today. Engineers were alerted. Skynet is self-aware.


yeah I'm very curious as to what sort of bug deploys a bad configuration and then magically deploys the fix 30 minutes later...


I don't know the specifics of what happened here, but in my experience with automatic configuration generation one must have a way to validate the config, but that validator can have bugs (as any other software).

Then either the software loading the configuration detects the problem or the monitoring system detects something's not right, and automatically the last working configuration is applied and the non working one is discarded.

By the looks of it I would say their monitoring detected the problem but the reliability team needed some minutes to realise it was a configuration problem. A classic problem is a network appliance that is misbehaving (eg. firewall, switch, etc), but nobody knows it is because of the configuration and it is replaced by a fallback appliance that... oh, has the same problem (configuration).

All together 25 minutes seems a lot, but when you're troubleshooting and you know an important part of your infrastructure is down, time fly!


Probably a race condition that happened in the first deployment and the system ran just fine the second time.


the bug was probably neatly tucked inside a conditional that was time and/or state related

if state=="x": break_google else: fix_google


Reads familiar? The outage pattern is identical to the one impacting GitHub January 8 [1] with my comment in the thread discussing the root cause [2]. Systems generating config files and then pushing them out to services within the infrastructure without proper checking and linting. In google's case, I just can't believe their systems are so delicately integrated and such critical component so botched up.

[1] https://github.com/blog/1759-dns-outage-post-mortem

[2] https://news.ycombinator.com/item?id=7081913


Can you be more specific? They don't sound very similar at all, though to be fair, google didn't provide that many details here. They both involve generated configuration files?

> Systems generating config files and then pushing them out to services within the infrastructure without proper checking and linting.

It's not really that easy, as "proper checking and linting" might as well be phrased as "sufficiently smart checking and linting". You can have amazing checking and linting and still let a bug pass through.


Modern DevOps - the ability to deploy catastrophic mistakes globally and instantaneously!

(what could possibly go wrong?)


I once experienced a similar issue (on a much smaller scale, obviously). Our DNS zone files were generated by a programand, thanks to a subtle bug, we went offline. The catch was that the generated files were valid and no amount of linting would have saved us from syntactically perfect wrong data.

The checks need to be quite sophisticated. We patched the generator, but never got to implement a test deploy mechanism to ensure the config made sense before deployment to a real production environment.


Also reminded me of this CloudFlare outage from last year, where a bad configuration rule caused their routers to crash: http://blog.cloudflare.com/todays-outage-post-mortem-82515

In that instance the rule was written by a human, but deployed automatically to their entire infrastructure before they could notice the problem it caused.


Sounds like a problem what Dropbox had experienced recently too: https://tech.dropbox.com/2014/01/outage-post-mortem/


I wonder if it was gen_aggregator.py again, 6 years later.


The main thing that comes to mind is why they do not deploy these kind of changes to a small slice and smoke test the slice before deploying to all users? This seems to be a pretty common routine for services at scale nowdays...?


I'm sure they use Canary Deployments, Gradual Rollouts and what have you to update their services.

I suppose this is a hard problem to solve on a configuration change level though. Imagine the configuration change that triggered the bug was something like "hey load balancers, stop sending traffic to the cluster with that new version of service X which seems to cause elevated error rates." You don't really want that kind of change to take too long to propagate.


Seems like the bug was in the config generator/deployer and not the service itself. So it's quite possible that things behaved normally during their dogfooding/smoke testing phase. But you're right, the config should've probably been rolled out region-by-region with a short baking period inbetween instead of a global outage.


It's almost certain that they do in general do this and the fact that this is not what happened is part of the issue. Any blog post describing something like this is going to leave out details such as this.


Depending on what these configuration files are used for it might not be ideal to update only part of the clusters. That might leave the system in an inconsistent way.


I gotta admit, even though I already knew this in the back of my head, the most surprising thing about this is that Google still uses Blogspot for stuff.


How is that surprising? Tumblr uses tumblr to post announcement and Google owns Blogspot of course would use blogspot to make announcement.


Honestly, I keep forgetting that Google owns Blogspot, mostly because I keep forgetting that Blogspot still exists.


You may visit a site hosted on blogspot more often than you think. Blogspot is blocked where I live (China) so I notice immediately when I click on a blogspot link on HN. If I didn't have to deliberately turn on VPN, I might just read the post without bothering to notice where it's hosted.


True. Matt Green for example uses blogspot too. http://blog.cryptographyengineering.com/

pycon blog is also on blogspot. http://pycon.blogspot.com/


Back in the day, we used to TEST configurations on a few machines before blowing them out to, you know, EVERYTHING!


Having a system then sends out config files sounds like part of the body that send out hormones,etc. Both can go wrong.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: