Google Compute Engine Incident #16015

piinbinary · on Aug 9, 2016

It seems like most work done to make distributed systems reliable is aimed at handling machines or groups of machines going down (e.g. the leader node in one region goes down at the same time as an entire other region). This half seems to be a solved problem.

The postmortems published by Google, Amazon, and Azure (as well as postmortems internal to the company I work for) are nearly always due to some type of change (code or configuration) being rolled out. It seems to me that we need some help from computers to make these systems reliable. - something like static type checking in a programming language, but applied to a distributed system. Perhaps your architecture wouldn't "compile" if the network traffic will go the wrong place, or if a rate limit is above the capacity something is expected to handle, or if the change would impact too many servers at once.

asuffield · on Aug 9, 2016

(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an SRE at Google. My team is oncall for this service and I know exactly what happened here; I probably can't answer most questions you might have.)

> Perhaps your architecture wouldn't "compile" if the network traffic will go the wrong place, or if a rate limit is above the capacity something is expected to handle, or if the change would impact too many servers at once.

So in the first instance, I tend to like this sort of idea. However: we are already substantially ahead of the sort of things that you're thinking of.

Full static simulation of a system as complicated as all the components involved here is... well, I can sort of see how it could be done, but it would be a herculean effort; I don't think it would ever be good enough to catch cases like this the first time they happen. There are systems where this sort of thing can be done, but all the ones I can think of are much smaller in scope.

rixed · on Aug 10, 2016

What does "static simulation" mean ? Probably not static analysis.

asuffield · on Aug 10, 2016

To fully answer questions like "how much traffic will go in this direction?" you need your analysis to include a simulation of what the entire internet is doing. That's hard.

I can't talk about the details, but you can assume that "static analysis" of the form being talked about here is something we've already done, and it's not enough to handle cases this complicated.

superuser2 · on Aug 9, 2016

>The postmortems published by Google, Amazon, and Azure (as well as postmortems internal to the company I work for) are nearly always due to some type of change (code or configuration) being rolled out

When you have a really large distributed system that's primarily running on metal, smaller-scale copies of the whole system per developer or even a single companywide staging environment that mirrors production are really hard, and they don't always exist.

Developers work on their components in isolation by mocking out the rest of the system, hopefully there's a thorough code review, and then "integration testing" happens by flipping a feature flag and watching the logs/metrics in production. You might design the feature flagging so it initially only hits test accounts, but some things (like service communication layers, Puppet configs, router configs) don't work that way.

The cost of outages resulting from the lack of a staging environment may well be less than re-architecting production (100% automation, no snowflakes, and probably some kind of IaaS) to allow for disposable dev/test environments which would exhibit the same bugs.

adrianratnapala · on Aug 9, 2016

But I think what you are saying actually is an argument for @piinbinary's idea of "static analisys".

If you can't really test your change, then the other thing you can do is try to rationally analyse its consequences. After all thats why "hopefully there's a throrough code review".

Wouldn't it be nice to automate some of that analysis? I have no idea how to go about such a thing, or whether it is even possible -- but the harder testing gets, the more attractive this option is.

BinaryIdiot · on Aug 9, 2016

> Perhaps your architecture wouldn't "compile" if the network traffic will go the wrong place, or if a rate limit is above the capacity something is expected to handle, or if the change would impact too many servers at once.

This is an interesting idea. I'm always terrified when I have to deploy a minor configuration change into a production system that gets distributed out; if there was an easy way to apply some sort of check against all of it (you know without having to build something explicitly to do this) that would be awesome.

I wonder if the future of something like this might be using containers. Each piece that gets deployed is in its own container networked together and you stand it up in a staging environment, ensure it's talking together correctly through some sort of set of integration tests then push it into the production environment.

DigitalJack · on Aug 9, 2016

I wish the law could be statically checked. In the US at least you are, at any given moment, likely breaking a law because for each and every law, there is one that contradicts it (at least partially).

ikeboy · on Aug 9, 2016

Relevant: http://faculty.msb.edu/hasnasj/GTWebSite/MythWeb.htm

truncate · on Aug 9, 2016

Reminds me of Godel's incompleteness theorem which says that you can't have a set of axioms which are both complete and consistent. I don't understand law or this theorem (and its implications) well enough, but I felt this has some relation here.

tener · on Aug 9, 2016

Quite likely you'd first have to invent strong AI which would give you a chance of defining a coherent theory of law which is really a tangled mess. Pretty sure no human has capacity to do it on their own in the foreseeable future.

DigitalJack · on Aug 9, 2016

I think I'd approach the problem from the other end, find a way to write laws such that they are checkable with assertions, properties, and constraints.

BinaryIdiot · on Aug 9, 2016

This would be amazing. If we had this the first line of judges could just be automated. Feed evidence in, computer finds you guilty or innocent, done. Then if you appeal you can seek a human judge in case judgement / exceptions need to be made (so say you technically broke a law but it was actually necessary / a good thing then you have recourse). First appeals could be very quick, less traffic to first human judge, etc.

Hell even without automated the justice system just doing as you suggested alone would be amazing. I've often wondered about putting laws in GIT or similar.

DigitalJack · on Aug 10, 2016

I'd be happy if the laws were simply self-consistent, and checkably so. enforcement/judgement I would leave to people.

And be self-consistent I mean that it doesn't contradict itself.

patrickaljord · on Aug 10, 2016

> putting laws in GIT or similar.

Tiny nitpick, it's git, not GIT :)

vertex-four · on Aug 9, 2016

You realise that humans are very, very good at figuring out what just doesn't count as breaking an extremely strictly defined law, right?

im4w1l · on Aug 10, 2016

This can be solved by proportional punishments. Then just not breaking the law, and just breaking the law gives similar outcomes (nothing vs slap on wrist)

plttn · on Aug 9, 2016

> Feed evidence in, computer finds you guilty or innocent, done.

Would you be willing to trust a program with your sentence? I wouldn't in the slightest.

BinaryIdiot · on Aug 9, 2016

Yes but only with the provisions I added that you didn't include in your quote :) (e.g. quick, efficient appeals process to where I can talk with a human judge). Plus the laws would have to be written in this fashion so it would be far, far less murky than what it is today.

So that's a lot of ifs and provisions...so I wouldn't expect this to happen ever.

truncate · on Aug 9, 2016

There is one to one correspondence between types and propositions, so I think it wouldn't do much with theory. I wonder type system expressive enough to encode laws would be decidable (at-least practically at scale of whole constitution) at first place.

lima · on Aug 9, 2016

In many cases you can do staged or partial rollouts, observe error rates or whatever suitable metrics you have, and then continue with the next bunch of servers.

ben_jones · on Aug 9, 2016

I'd be happy if their was just a decent linter for all configuration files. My test process right now is to spin up a vm and test something like an nginx reload or an ansible-playbook, then if it passes apply it to a distributed testing environment and then if it passes consider applying it to production.

This seems crazy to me when intelli-sense for many languages can seemingly read my mind but configuration is in many ways still a trial and error process..

Does anyone know of ways to parse many of the common configuration file formats? From linting to code completion to best practice (or worst practice) checks. IF not and somebody wants my money yesterday, theirs your startup.

jacobbudin · on Aug 9, 2016

If you have nginx installed, you can test nginx configurations like so: $ nginx -t -c <configuration file>

ben_jones · on Aug 9, 2016

I still have to spin up the vm though to make sure changes to things like proxy settings won't affect me further down the stack. But thanks for the tip!

icebraining · on Aug 9, 2016

Ansible also has --syntax-check and even --check, which "tries to predict some of the changes that may occur".

sumitgt · on Aug 9, 2016

I work on a service with high OLA requirements. I'm terrified everytime I roll out a new build or config change, in spite of having tested them thoroughly. At the scale at which these things operate, it's very difficult to manually imagine all the possible ways things can go wrong. Especially when you have multiple services that are all interdependent on each other and written/maintained by different teams in different geographic locations.

I'm sure there is lots of research happening in this area within Microsoft and Google. Can't wait to have this sooner.

whatupmd · on Aug 9, 2016

They switched to openflow on internal networks for this reason, deterministic: https://www.youtube.com/watch?v=FaAZAII2x0w

External network is BGP though and sounds like they didn't detect it until user reports hit. They can't predict the problem and there detection isn't working well either.

notyourwork · on Aug 9, 2016

Good idea in theory but not sure how it pans out in practice. Defining the wrong place would require some configuration would could as easily be mis-configured.

Abstracting the problem to another layer isn't always the answer.

foolfoolz · on Aug 9, 2016

at the platform-wide incident level its more often configuration than code

jread · on Aug 9, 2016

According to network checks we have running against VMs in each GCE region, this event resulted in about 1.6 hours of concurrent ICMP timeouts for every region (except us-west1 for only 10 minutes). We use Panopta for monitoring which verifies outages from multiple external network paths. When outages trigger we also use Ripe Atlas to confirm them using 100s of last mile network paths of which 85-95% resulted in timeouts. This is the second global GCE networking outage this year. These outages are particularly problematic because even multi-region load balancing will not avert downtime.

https://cloudharmony.com/status-for-google

Prior global outage - April 11: https://status.cloud.google.com/incident/compute/16007

Disclaimer: I am the founder of CloudHarmony Edit: Outages triggered due to ICMP timeouts

sshykes · on Aug 9, 2016

Not accusing you of being intentionally misleading, but it would be nice if you put a clear disclaimer pointing out that you work for CloudHarmony.

jread · on Aug 9, 2016

I founded CloudHarmony and now work for Garter which acquired CloudHarmony in 2015.

simonebrunozzi · on Aug 9, 2016

I think it would still be nice to mention that you work there.

lima · on Aug 9, 2016

Their post-mortem states that only traffic for protocols other than TCP and UDP was dropped. Does your monitoring take this into account?

It states "1.67 hours" of downtime, which is not the same as increased latency due to sub-optimal routing.

jread · on Aug 9, 2016

Our outages were triggered as a result of ICMP timeouts.

lima · on Aug 9, 2016

In this case it would be fair to note that this outage did not affect all connectivity (web traffic would be unaffected while ICMP were blocked).

jread · on Aug 9, 2016

Is there a reason for the downvotes?

slau · on Aug 9, 2016

Most probably because you didn't edit to post a disclaimer. It just looks like someone trying to profiteer off of GCE's incident.

jread · on Aug 9, 2016

I've added a disclaimer. We do not profit in any way from these incidents.

GilbertErik · on Aug 9, 2016

While it's true that an auto insurance company doesn't profit off car crashes... if people _never_ had car crashes, we might not need auto insurance companies.

ajkjk · on Aug 9, 2016

Presumably because your post feels like an ad.

_ondq · on Aug 9, 2016

The most disturbing part of this incident--to me--is that Google Search/YouTube/GMail/etc never went down. Not even a blip.

That means that Google is not dogfooding GCE to the degree that I would hope and expect for me to risk my business on using the service. Disappointing, to say the least.

Say what you will about AWS (and I've said many critical things, and will continue to do so, publicly and privately), but when AWS has a major outage it also affects Amazon digital products. They have major skin in the game, while it appears that GCE is a special snowflake service completely separate from the important, money-making services at Google.

manigandham · on Aug 9, 2016

This always comes up but Google has been pretty explicit in saying that they share some internal infrastructure but that lots of existing stuff isnt running on GCP. They have decades of customized software/hardware running already.

GCP is pretty new and there are plenty of big customers running on GCP that you can reference like Spotify and Snapchat. Also it is a rather important and money-making service, on track to potentially eclipse their entire ad business.

_ondq · on Aug 9, 2016

Just because they say it doesn't mean it's good, or even acceptable.

"[O]n track to potentially eclipse..." sounds like the worst form of quarterly-report spin.

manigandham · on Aug 9, 2016

What? Good/acceptable in what sense?

This is just the size of the market, cloud computing is already a major industry and has just started. The potential upside for a major cloud player dwarfs the entirety of digital advertising.

Bartweiss · on Aug 9, 2016

At a certain point, I think Google could probably defend this even to outsiders with "we're too important".

It might be good if Youtube or something else big-but-tangential was on the line when GCE changes went out. Search, though? It dropped globally for 2 minutes in 2013, and took out 40% of internet traffic in the process. There's a real argument that it's critical infrastructure like little else, since Google and Bing are the way most users have of reaching sites other than the core Facebook-Amazon-Yahoo locations.

More dogfooding wouldn't be a bad change overall, but people counting on Search and Gmail would probably prefer a minimal-downtime solution no matter how it's implemented.

_ondq · on Aug 9, 2016

Oh I totally get why they do it. When you're an exec in charge of a ~$75B revenue stream[0] when the GCE people come to you I totally get saying "Nah, let's leave it alone for now".

But like Steve Yegge said in his classic rant[1], when you're behind in the market you have to launch your cloud product (which is really just a subset of the "platform" problem Yegge was actually writing about) "...all at once, for real, no cheating".

If you don't have the organizational courage to take big risks, well I guess you deserve your 2.5% market share[2].

0. http://marketingland.com/google-revenues-beat-expectations-w...

1. https://gist.github.com/chitchcock/1281611

2. http://www.datacenterknowledge.com/wp-content/uploads/2016/0...

boulos · on Aug 9, 2016

There was internal loss, but it was the combo of these:

> resulted in announcing some Google Cloud Platform IP addresses from a single point of presence in the southwestern US.

> not yet configured to handle Cloud Platform traffic and applied an overly-restrictive packet filter.

that caused the real pain. Other traffic that ended up in the Southwest would have routed crazily, but wouldn't get stopped (filter). If it had just been a route thing for GCE (and related services) we would have just seen slowdown, etc. like any other routing mistake.

Disclosure (and Disclaimer): I work on Compute Engine, but I'm not on our networking teams.

stickfigure · on Aug 9, 2016

Search/YouTube/GMail/etc predate GCE by many years. In the case of Search, almost two decades. You can't expect them to rewrite everything every time a new tool comes out.

zzzcpan · on Aug 9, 2016

Networking is so fragile. I feel like DNS failover to another AS in another datacenter is the only solution for web services to actually have resilience against such failures.

BinaryIdiot · on Aug 9, 2016

I feel like DNS is one of the best ways to do service discovery but its complexity is always baffling to me. I mean I understand roughly how it all works but if you asked me to stand up DNS internally for various servers to use it...might take me a while.

I've wondered if it's simply because I don't know it well enough or if we need an easier, more simple solution to replace DNS in internal networks. There are a ton of ways to do service discovery.

eropple · on Aug 9, 2016

> I've wondered if it's simply because I don't know it well enough

IMO, and emphatically not ripping on you: it's this one. DNS is a pretty simple protocol and at "you don't have ten thousand developers" scale managing it is pretty easy. It might take you a while the first time, for sure--but it won't the second, third, or twelfth time. =)

In particular, even setting aside explicit failover cases (which do require some outside monitoring and staging to make useful, i.e. when do you failover?), round-robin DNS actually does a lot to provide redundancy, internal to a single data center or deployment, when coupled with SRV records.

BinaryIdiot · on Aug 9, 2016

Fair enough. Any suggested reading material regarding this subject?

thenewwazoo · on Aug 9, 2016

I honestly recommend just setting up a toy split-view DNS service with a hidden master using BIND9. Get zone transfers working and maybe play with dynamic updates, and you'll basically know all you need to know. I set up an internal DNS infra at a previous job and was pretty surprised at how simple it really ended up being.

BinaryIdiot · on Aug 10, 2016

Thanks! I'll take a look.

toomuchtodo · on Aug 9, 2016

Ops engineer here. You are correct.

diafygi · on Aug 9, 2016

I'm a little confused about the mechanics of how dns failover works. Would you care to explain a bit more?

spotman · on Aug 9, 2016

There is a few things possible here. You can do global load balancing which will serve a DNS entry for a hostname based on business rules. Those business rules can contain things like geo targeting, and/or they can also contain health checks. So you could direct traffic via dns based on if a region is online at all, or more granular things like is it meeting performance criteria, etc.

Even more simple, is DNS round robin (just putting more than one entry in the reply for a dns request). So in terms of failover, if one of these servers in the reply stops working, generally another one is, and browsers and other devices generally understand that they can try all addresses listed for a given host.

ben_jones · on Aug 9, 2016

Isn't DNS cached at some level on the users machines such that if a particular hostname went down clients which cached that hostname will still be directed to it?

FWIW I'm making it a thing today to read up on DNS as this post made me woefully aware of how little I understand it. So feel free to skip this question and I'll probably answer it myself tonight.

toomuchtodo · on Aug 9, 2016

Typically, with round robin DNS, the browser will try the other records if one fails. Its not perfect, but its good enough.

In a perfect world, you use short TTLs (30-60 seconds), and withdraw a record from being advertised as soon as your health checks are failing for it.

ben_jones · on Aug 9, 2016

That makes me think of something like an API-hungry mobile application that might make ~5 requests a second. If it had a 60-second cached DNS record to a host name that corresponded to a failed host that would lead to 300 failed requests and a really shitty user experience :(.

spotman · on Aug 9, 2016

Your correct, but this is combatted in practice through UX usually. Doing work in the background, doing retry w/ decay, bubbling up things that didn't go well to the user at the right time, etc. For a client to just spam 300 requests the moment there is a hiccup that is generally not good. At scale its indistinguishable from a DDOS (been there, hah!).

Also for it to be that jarring we are talking about a failure at the exact moment you are using the data-heavy app, and not 1 minute before. Every time you go to use the app its likely the dns is re-queried.

So, at the end of the day if you have a mission critical thing running where you need to try very hard and can't tolerate 60 seconds of failure, you would bake multiple paths to a backend in the client configuration at some point. So the client would know if backend A is down it can do client side failover to backend B.

Sometimes this technique is used in conjunction with a highly fault tolerant server stuff, and not as a replacement for it.

ben_jones · on Aug 9, 2016

Thanks for the follow-up, it's always insightful to hear from people who have seen these things in the trenches on larger systems then I've had privy to.

falcolas · on Aug 9, 2016

If only DNS failover were as atomic as would be required to do good failover. Sadly, so many clients and caching mechanisms ignore TTL values that an actual failover (not to mention fail back) could take days.

lazyant · on Aug 9, 2016

It's "a" solution but not perfect since now you depend on the clients to use the new records (caches, respecting TTL), there's also propagation times etc.

zzzcpan · on Aug 9, 2016

What are the alternatives? I don't think you can prevent problems affecting the whole AS or problems affecting the whole company with anything but failing over to another AS or another company. And the only place you can do that for a web browser is DNS.

And there is some beauty in it. It's fairly simple and with more servers in more places you get fewer affecting clients when one of the places fails and fewer disrespectful clients as well. It scales very well and cheaply.

pwatsonwailes · on Aug 9, 2016

There's a refreshing humility in admitting how it went wrong, and what they've done to fix it, and not mentioning any names that I admire here. No attempting to lay the blame at someone's feet, just admitting that a process failed, and they've amended it so it doesn't happen again.

Kudos.

nine_k · on Aug 9, 2016

Google's culture of blameless post-mortems is commendable.

In internal post-mortems names are named when necessary for disambiguation, without consequences to the persons named. This results in all participants being candid, and the real source of the problem more easily uncovered. (I wrote two post-mortems for small incidents I triggered.)

BinaryIdiot · on Aug 9, 2016

> without consequences to the persons named

I would hope so (everyone makes mistakes some of which may lead to HUGE issues) and not giving them a chance to learn from it is a bit unrealistic. At the same time, however, I have been part of organizations where something has gone wrong and us engineers basically all had to "take the blame" otherwise the single person likely would have been fired. I'm confident Google is not like that but it's always something I think about.

ben_jones · on Aug 9, 2016

We live in a culture where we invoke a draconian system of punishment as a "justice" system. The idea, which does not translate to software, is that fear will keep the local systems in line. Fear of this battle station. So the fact that certain companies recognize a huge societal flaw and actively operate against it is rare (imo).

0xmohit · on Aug 9, 2016

Yes, you do expect transparency from such service providers.

It sucks when somebody starts to point you at their SLA when you ask them about outages. I'm looking at you, AWS.

hehheh · on Aug 9, 2016

I've seen Google ask for an NDA before acknowledging a problem existed and was fixed. They are at best selectively transparent.

pjlegato · on Aug 9, 2016

Attention startups: THIS is what your public postmortems should look like.

This level of detail and honesty, this level of specific steps that are being taken to prevent recurrence.

75% of startups try to sweep downtime under the rug and don't even acknowledge problems happened at all -- or only do so privately in a personal support email after hours of interrogation. Another 20% just write "we had some minor issues but everything is fine now."

Be like Google.

RantyDave · on Aug 9, 2016

I'm constantly amazed by the competence of these 'big cloud' teams, I can't imagine the amount of preparatory work they put in. It does, however, make it clear how badly the Internet is starting to creak.

itaysk · on Aug 9, 2016

Shit happens for everyone, and this kind of error could have happened for other clouds as well. It's the responsibility towards customers and lessons learned that matter.

josephb · on Aug 9, 2016

Some detailed info which is great, however I am more interested in knowing 1) why, or what, led to the wrong decision being made, and 2) why it took so long to notice / revert.

d4l3k · on Aug 9, 2016

These configurations are absurdly finicky. If you remember a couple of years ago, Pakistan took down youtube to a similar border gateway protocol misconfiguration. These things are pretty quick to be caught, but the configuration changes take time to propagate (in this case 30m).

http://www.cnet.com/news/how-pakistan-knocked-youtube-offlin...

karmicthreat · on Aug 9, 2016

So in instances like this where everything goes wrong. Does google have the equivalent of a revert button to undo whatever infrastructure changes were done?

asuffield · on Aug 9, 2016

(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an SRE at Google. My team is oncall for this service and I know exactly what happened here; I probably can't answer most questions you might have.)

Let's go with "yes", as the most accurate answer. As soon as I or whoever is oncall has figured out what change was responsible, we can usually revert it quickly and easily. Usually, if I'm oncall and I have reason to even suspect a recent change might be the cause, I'll revert it and see if the problem goes away.

The difficulty becomes more apparent when you realise the sheer number of infrastructure changes being made every hour, some of which will be fixes to other outages, and some of which will be things you can't revert because they are of the form "that location has fallen offline; probably lost networking" or "we are now at peak time and there are more users online". So if your question is "can we just roll the whole world back one day" - no, too much has changed in that time.

karmicthreat · on Aug 14, 2016

I know it late but thanks for this. It kind of reinforces the amazing size of the system and the number of people making changes to it.