I don't know the specifics of what happened here, but in my experience with automatic configuration generation one must have a way to validate the config, but that validator can have bugs (as any other software).
Then either the software loading the configuration detects the problem or the monitoring system detects something's not right, and automatically the last working configuration is applied and the non working one is discarded.
By the looks of it I would say their monitoring detected the problem but the reliability team needed some minutes to realise it was a configuration problem. A classic problem is a network appliance that is misbehaving (eg. firewall, switch, etc), but nobody knows it is because of the configuration and it is replaced by a fallback appliance that... oh, has the same problem (configuration).
All together 25 minutes seems a lot, but when you're troubleshooting and you know an important part of your infrastructure is down, time fly!
Reads familiar? The outage pattern is identical to the one impacting GitHub January 8 [1] with my comment in the thread discussing the root cause [2]. Systems generating config files and then pushing them out to services within the infrastructure without proper checking and linting. In google's case, I just can't believe their systems are so delicately integrated and such critical component so botched up.
Can you be more specific? They don't sound very similar at all, though to be fair, google didn't provide that many details here. They both involve generated configuration files?
> Systems generating config files and then pushing them out to services within the infrastructure without proper checking and linting.
It's not really that easy, as "proper checking and linting" might as well be phrased as "sufficiently smart checking and linting". You can have amazing checking and linting and still let a bug pass through.
I once experienced a similar issue (on a much smaller scale, obviously). Our DNS zone files were generated by a programand, thanks to a subtle bug, we went offline. The catch was that the generated files were valid and no amount of linting would have saved us from syntactically perfect wrong data.
The checks need to be quite sophisticated. We patched the generator, but never got to implement a test deploy mechanism to ensure the config made sense before deployment to a real production environment.
In that instance the rule was written by a human, but deployed automatically to their entire infrastructure before they could notice the problem it caused.
The main thing that comes to mind is why they do not deploy these kind of changes to a small slice and smoke test the slice before deploying to all users? This seems to be a pretty common routine for services at scale nowdays...?
I'm sure they use Canary Deployments, Gradual Rollouts and what have you to update their services.
I suppose this is a hard problem to solve on a configuration change level though. Imagine the configuration change that triggered the bug was something like "hey load balancers, stop sending traffic to the cluster with that new version of service X which seems to cause elevated error rates." You don't really want that kind of change to take too long to propagate.
Seems like the bug was in the config generator/deployer and not the service itself. So it's quite possible that things behaved normally during their dogfooding/smoke testing phase. But you're right, the config should've probably been rolled out region-by-region with a short baking period inbetween instead of a global outage.
It's almost certain that they do in general do this and the fact that this is not what happened is part of the issue. Any blog post describing something like this is going to leave out details such as this.
Depending on what these configuration files are used for it might not be ideal to update only part of the clusters. That might leave the system in an inconsistent way.
I gotta admit, even though I already knew this in the back of my head, the most surprising thing about this is that Google still uses Blogspot for stuff.
You may visit a site hosted on blogspot more often than you think. Blogspot is blocked where I live (China) so I notice immediately when I click on a blogspot link on HN. If I didn't have to deliberately turn on VPN, I might just read the post without bothering to notice where it's hosted.