AWS Service Disruption Post Mortem

nicpottier · on April 29, 2011

tldr: ""The trigger for this event was a network configuration change. We will audit our change process and increase the automation to prevent this mistake from happening in the future."

AMZN has gotten a lot of flack over this outage, and rightly so. But I do want to dissuade anyone from thinking anybody else could do much better. I worked there 10 years ago, when they were closer to 200 engineers, and the caliber of people there at that point was insane. By far the smartest bunch I've ever worked with, and a place where I learned habits that serve me well to this day.

I know the guys that started the AWS group and they were the best of that already insanely selective group. It is easy to be an arm chair coach and scream that the network changes should have been automated in the first place, or that they should have predicted this storm, but that ignores just how fantastically hard what they are doing is and how fantastically well it works 99(how many 9's now?)% of the time.

In short, take my word for it, the people working on this are smarter than you and me, by an order of magnitude. There is no way you could do better, and it is unlikely that if you are building anything that needs more than a handful of servers you could build anything more reliable.

ekidd · on April 29, 2011

Given a choice between hosting servers on AWS, and trying to build my own reliable infrastructure with a single sysadmin, I'll take AWS in a heartbeat. But I do want to quibble with one of your points:

It is easy to be an arm chair coach and scream that... they should have predicted this storm

I'm not as smart as the AWS developers, and I have a lot less experience with large-scale distributed systems.

But thanks to my own cluelessness, I've blown up smaller distributed systems, and I've learned one important lesson: Almost nobody is smart enough to understand automatic error-recovery code. Features like automated volume remirroring or multi-AZ failover increase the load on an already stressed system, and they often cause this kind of "storm."

So I've learned to distrust intelligence in these matters. If you want to understand how your system reacts when things start going wrong, you have to find a way to simulate (or cause) large-scale failures:

This is something that Google does really really well by the way, I've watched them turn of 25 core routers simultaneously carrying hundreds of gigabits worth of data, just to verify that what they think will happen, does happen. http://news.ycombinator.com/item?id=2475112

You also need to pay particular attention to components with substantial, ongoing problems, and make sure you don't let known issues linger:

I work at Amazon EC2 and I can tell you what's going on (thanks to this handy throwaway account). What's happening is the EBS team gets inundated with support tickets due to their half-assed product. Here's the hilarious part: whenever we've asked them why they don't fix the main issue, they keep telling us that they're too busy with tickets. What they don't seem to realize is that if they fixed the core issue the tickets would go away. http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...

Now, I'm not saying I could have done any better than Amazon (evidence suggests otherwise). But I do know that I'm not smart enough to understand these systems without testing them to destruction, and aggressively fixing the root causes of known problems.

SoftwareMaven · on April 29, 2011

I would be surprised if Amazon didn't do testing similar to the Google quote above. AWS is insanely complicated, and the problem with an insanely complicated system is there are infinite failure modes (or such a big number it is close enough to infinite for human purposes). As such, it is impossible to test all of them.

shaloum · on May 1, 2011

No one seems smart enough though ;) http://groups.google.com/group/google-appengine-downtime-not...

eddieplan9 · on April 29, 2011

But thanks to my own cluelessness, I've blown up smaller distributed systems, and I've learned one important lesson: Almost nobody is smart enough to understand automatic error-recovery code. Features like automated volume remirroring or multi-AZ failover increase the load on an already stressed system, and they often cause this kind of "storm."

It's basically Test-Driven Development: if you cannot test it, don't write it.

adpowers · on April 29, 2011

It is hard to test emergent behavior in large distributed systems, you pretty much have to actually run the tests live to see what is going to happen and see if it aligns with your predictions.

rlpb · on April 29, 2011

> There is no way you could do better, and it is unlikely that if you are building anything that needs more than a handful of servers you could build anything more reliable.

I can't disagree, but there is one key benefit to not using the cloud for some services.

When your company is working on an important deadline, your sysadmins could choose not to implement that pending network configuration change during that crucial period. You can control your own at-risk times, which you can't generally do with IaaS.

As with everything, it's a trade off.

leoc · on April 29, 2011

See, this makes me uneasy. The best and brightest work very hard to avoid memory errors in their C codebases then, when errors occur anyway, console themselves that no-one could have done any better. The less brilliant look into more automatic forms of memory management. AT&T hired the best and brightest to make heroic efforts so that the old circuit-switched Bell system could achieve mediocre reliability. I certainly hope that the best and brightest at AWS aren't intending to tackle their enormous control-backplane SPOF by assiduously patching every bug that turns up in its behaviour.

fauigerzigerk · on April 29, 2011

If it's too hard for the best then the concept is dead. But I don't believe it. They made some avoidable mistakes.

Just look at the pattern emerging from these kinds of incidents. There's an automatic cluster recovery mechanism that works for individual node failures but makes matters worse once a larger number of nodes fail.

I wonder whether they did extensive testing or simulation of that scenario. The initial root cause is probably unpredictable because there may be many, but what follows is not unpredictable.

I'm not ready to concede that because they are such an insanely smart elite group of people we just have to live with week long outages.

nicpottier · on April 29, 2011

The problem with all this is that you never hear about the many many failures that occur (daily, hourly at the scale they operate?) that don't cause an outage.

ra · on April 29, 2011

I think it's more of a learning experience.

In the grand scheme of things, it's still day one for computing services like AWS.

scott_s · on April 29, 2011

I think you're drawing the wrong conclusions from the grandparent post. It's not These guys are the best, so we will just have to live with these failures. It's These guys are smart people sailing in uncharted waters, so they're going to get a little lost now and then. The second conclusion implies that these waters will eventually become charted, so everyone will be able to avoid these problems in the future.

fauigerzigerk · on April 29, 2011

To me, the parent as well as your comment sounds a little bit too apologetic. The waters are not totally uncharted and the outage was disproportionate. AWS isn't a research project, it's a commercial offering, so they have to take some blame. That re-mirroring storm was not completely unforeseeable. It's exactly the thing you have to consider when designing these kinds of systems.

e98cuenc · on April 29, 2011

I've been working at Google, and had a similar experience. But I do think in many cases you can actually do better. It's not that you're smarter than Amazon/Google/... engineers, it's that you're solving a problem that it's orders of magnitude easier.

Your start-up is probably not dealing with setting up & maintaining 200K servers as Amazon, so up to a limit you can actually be better going on your own.

nicpottier · on April 29, 2011

Yep, that was my point with the 'handful of servers' line.

But even that is pretty darn hard. You still have to deal with split brain situations, do all the right things in those cases etc.. If you are a tiny startup that isn't going to hire a dedicated sysadmin (very few of which can build that type of thing without buying some expensive hardware) then EC2 is probably a better choice.

Even just the hardware costs make it pretty braindead. We just shut down the dedicated three system setup I built because it just didn't make sense financially, and that despite me spending a month learning the intricacies of heartbeat it still had odd failure scenarios that we don't experience with EC2. Again, I'm just not that smart. :)

ianso · on April 29, 2011

Agreed with the expertise thing, but to add to your summary of the post-mortem, it looks like human error compounded by: 1) An architecture bug (the EBS "control plane" cuts across Availability Zones and EBS clusters, leading to a single point of failure: this is what broke the "service contract"), 2) a spec/programming bug: No aggressive back-off on retry attempts of EBS ops, and 3) two separate logical bugs: the race condition in the EBS nodes & problems with MySQL replication.

I think that's everything. It just goes to show, most disasters in very well-engineered systems are generally the result of a series of things all going wrong at once, not individual failures...

OllieJones · on April 30, 2011

Agree completely. We are still in the early days of AWS-like technology. Providers and users both will experience some serious issues in the years to come.

Heck, users of municipal water systems still experience outages and that technology is arguably mature.

The really encouraging thing is that the Amazon post-mortem writeup indicates they're taking it very seriously.

icedpulleys · on April 29, 2011

My tl;dr reads differently: "we lacked appropriate congestion control in our EBS recovery algorithms."

The trigger was a bad and unexpected network configuration change, but the error was that the attempted recovery by the stuck volumes was that it was uncontrolled.

I don't think that anyone is knocking the intelligence of the AWS engineers, or saying that anyone else could do it better. Just like NASA engineers and scientists are incredibly intelligent and good at what they do, systems can become complicated enough that unexpected errors creep into the system, and not any particular component.

kamaal · on April 29, 2011

The best are human beings too! I mean even the best work in two ways. Build on best principles already known, and make your own best principles. The first depends on how much you know, how much time you have taken to learn, implement, practice and perfect. The second depends on how far you can see.

In either case you can always overlook or fail to predict the even the easily foreseeable future. And that happens due to many reasons plain human error, or even over confidence which is some times the case with the best.

The best way out of this problem is what Jeff Atwood had blogged some time days back. Is to keep failing, and keep failing in different ways. And each failure needs to be translated into to lessons of some sort and then the solution of it a best practice. Even the Netflix model of failing purposefully will do.

There is no way the best can be flawless. Nothing is flawless, so as long it is done by a human.

g123g · on April 29, 2011

Agreed. It has actually given a very nice excuse to developers like me. Whenever your manager comes up to find why something that you are working on is not up, you can simply reply with something like this -

If AWS with some of the smartest engineers can be down for that long do you think that our crappy service will be up 100% of time?

JohnsonB · on April 30, 2011

A single manual configuration mistake should never be able cause this type of complete system failure. For AWS to be engineered this way, the engineers you met must not have been as smart as they seemed. EBS may be extremely sophisticated in other respects, but if this type of thing was even a possibility, the AWS team are very fallible, and probably far from the best in their field.

>In short, take my word for it, the people working on this are smarter than you and me, by an order of magnitude. There is no way you could do better, and it is unlikely that if you are building anything that needs more than a handful of servers you could build anything more reliable.

Ever since the AWS outage, I've seen a number of these "the AWS guys are so smart, I've met them." type comments. And then paraphrased: "There sort-of can't be that much to blame on them because of how smart they are, and they are so smart anyway, who could do better?" That's not a valid argument, not everyone is equally impressed by an individuals intelligence, perhaps your assessment is wrong. And even if someone is insanely smart, they still can commit practical errors which indicates they are smart, but still flawed in their understanding of engineering in significant ways. Perhaps AWS simply does need a higher caliber of engineer that wouldn't miss out on these dead-simple safeguards that would have prevented this outage.

jwe · on April 29, 2011

It's not so much about thinking that somebody else could do better.

By doing what they do they create the expectation that they _are_ doing better than everybody else.

lemming · on April 29, 2011

And let's face it, despite bugs and human error that should be avoidable, they are.

ssmoot · on April 29, 2011

At that scale, MAYBE.

In the last two years my workplace has gotten pretty good at handling SAN outages (due to terrible Oracle equipment).

Put simply, this set of scenarios can't happen at my work-place. We don't have that level of automation, there's only a pair of SAN systems in the mirrors, there's no "hunting for capacity".

I'd suggest most businesses are closer to this than AWS.

There's a lot of sugar coating going around saying "You couldn't build a better space-shuttle", and that's probably true. But if I only need an extremely reliable bicycle that's a false argument. The Simplest Thing That Could Possibly Work doesn't apply just to programming.

bretthopper · on April 29, 2011

I've been noticing a trend recently when reading about large scale failures of any system: it's never just one thing.

AWS EBS outage, Fukushima, Chernobyl, even the great Chicago Fire (forgive me for comparing AWS to those events).

Sure there's always a "root" cause, but more importantly, it's the related events that keep adding up to make the failure even worse. I can only imagine how many minor failures happen world wide on a daily basis where there's only a root cause and no further chain of events.

Once a system is sufficiently complex, I'm not sure it's possible to make it completely fault-tolerant. I'm starting to believe that there's always some chain of events which would lead to a massive failure. And the more complex a system is, the more "chains of failure" exist. It would also become increasingly difficult to plan around failures.

edit: The Logic of Failure is recommended to anyone wanted to know more about this subject: http://www.amazon.com/Logic-Failure-Recognizing-Avoiding-Sit...

akuchling · on April 29, 2011

A similar point is made in Gene Weingarten's "Fatal Distraction" (http://www.pulitzer.org/works/2010-Feature-Writing), which was about parents who forget a child in the car. Excerpt: "[British psychologist James Reason] likens the layers to slices of Swiss cheese, piled upon each other, five or six deep. The holes represent small, potentially insignificant weaknesses. Things will totally collapse only rarely, he says, but when they do, it is by coincidence -- when all the holes happen to align so that there is a breach through the entire system."

siculars · on April 29, 2011

This is an interesting point you hit on and something that is stressed in scuba diving. Basically whenever you go out on a dive you only want to change one thing at a time. Only one thing can be "new" or "untested" or "new to you". Otherwise you run the risk of being task overloaded which leads to cascading failure - potentially catastrophic and/or nonrecoverable.

holri · on April 30, 2011

That's why Sergei Korolev the chief engineer behind the successfull russion space program that led to Gargarins victory said:

„The genius of a construction lies in its simplicity. Everybody can build complicated things."

rapind · on April 29, 2011

I've been an AWS (S3, EC2, SQS) user for over 3 years now and this article detailing their systems at a mid-level is kind of scaring me off of their platform. It just sounds so complicated and I'm not sure I want to rely on it for anything critical until I can really understand it myself.

Also, a couple other complex systems for your trend are; financial markets and commercial jets.

olefoo · on April 30, 2011

A related book that examines some of the same themes is Charles Perrow's Normal Accidents http://press.princeton.edu/titles/6596.html

The examples he draws from are nuclear power plant failures (TMI in particular), civil aviation and oil transport. But the basics will be recognizable to anyone who has dealt with large computing installations; interactive complexity, tight coupling and cascading failures.

It is not a reassuring book, you won't be able to look at any complex system without asking yourself what sequence of simple, predictable failures of widely separated parts could tip it into a catastrophic failure mode.

bd_at_rivenhill · on April 30, 2011

Here's another good resource for understanding these types of problems:

http://www.amazon.com/Normal-Accidents-Living-High-Risk-Tech...

Smerity · on April 29, 2011

> The nodes in an EBS cluster are connected to each other via two networks. The primary network is a high bandwidth network... The secondary network, the replication network, is a lower capacity network used as a back-up network... This network is not designed to handle all traffic from the primary network but rather provide highly-reliable connectivity between EBS nodes inside of an EBS cluster.

During maintenance instead of shifting traffic off of one of the redundant routers the traffic was routed onto the lower capacity network. There was human error involved but the network issue only provoked latent bugs in the system that should have been picked out during disaster recovery testing.

Automatic recovery that isn't properly tested is a dangerous beast; it can cause problems faster and broader than any team of humans are capable of handling.

thebootstrapper · on April 29, 2011

One of the main cause for "re-mirroring storm," is node not backing off from finding a replica.

Here's Twitter Back off decider implementation (Java)

https://github.com/twitter/commons/blob/master/src/java/com/...

When last time i looked i was little clueless on this. Now I find its usage.

biot · on April 29, 2011

Actual URL with Libya dependency removed: https://github.com/twitter/commons/blob/master/src/java/com/...

HN doesn't have a 140 character limit, so there's no need to post an obfuscated shortened link.

thebootstrapper · on April 29, 2011

My fault. Edited the link. Thanks.

hobbes · on April 29, 2011

>...one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly...

This supports the theory that between 50%-80% of outages are caused by human error, regardless of the resilience of the underlying infrastructure.

brown9-2 · on April 29, 2011

This supports the theory that between 50%-80% of outages are caused by human error

Not quite - in this case, a single human error then triggered a series of latent and undiscovered bugs in the system itself. It's a confluence of small events that makes for a large-scale problem like this.

kamaal · on April 29, 2011

Unfortunately the humans in reference here are supposed to be the best of the lot. This brings in the eternal question during times of crisis ,how are the best very different than the mediocre or the average?

If not then with little hard work and smart work here and there any body can beat these 'best' during non crisis times. And during the crisis time all are same any way.

Probably that's why there are a lot of successful companies even with average talented people.

hobbes · on April 29, 2011

The very best make fewer errors.

tomjen3 · on April 29, 2011

Which leaves a question: why not engineer around humans, such that they are never needed in the day-to-day running of the systems?

ars · on April 29, 2011

That "engineering" you speak of? That's the source of the human error.

neworbit · on April 29, 2011

Any high availability system inherently aims for that, but "never" is a strong word. Sooner or later, you need to upgrade or replace physical hardware.

henrikschroder · on April 29, 2011

"The trigger for this event was a network configuration change. We will audit our change process and increase the automation to prevent this mistake from happening in the future."

tomjen3 · on April 29, 2011

Sure now, but I wanted to know why they didn't do that from the beginning.

michael_dorfman · on April 29, 2011

Why they didn't do what? "Increase the automation"? I suspect that they have been doing that from the beginning. It's an ongoing process.

I hate to quote Rumsfeld, but there are known unknowns, and unknown unknowns. Of course you want to eliminate the latter-- but there's (necessarily) no way you can ever know that you've done so.

wanderr · on April 29, 2011

I highly recommend that anyone who was surprised by this outage, or the description of the chain reaction of failures that lead to it, read Systemantics. It is a dry but amusing exploration of the seemingly universal fact that every complex system is always operating in a state of failure, but the complexity, failovers and multiple layers can hide this, until the last link in the chain finally breaks, usually with catastrophic results.

gruseom · on April 29, 2011

read Systemantics

Oh yes. It's a classic that deserves to be much better known. Anybody engaged with complex systems - such as software or software projects - will find all kinds of suggestive things in there. As for "dry"... come now, it's hilarious and has cartoons.

Basically, just get it. Here, I'll help:

http://www.amazon.com/Systems-Bible-Beginners-Guide-Large/dp...

(They ruined the title but it's the same book.)

senthilnayagam · on April 29, 2011

AWS was numero uno in terms of customer visibility and the image of a pathbreaking cloud service, before the incident.

Lack in transparency in reaching out to customers is the biggest mistake what AWS did. They would learn from their mistakes, their servers and networks would be more reliable than ever.

This incident has given a reason for people to look at multi-cloud operation capability, for disaster recovery and backup reasons. AWS monopoly would be gone, there would be many new standards which would be proposed to bring in interoperability and for migrations between clouds.

charper · on April 29, 2011

Seems there is always this issue. System fails. Systems try to repair themselves. Systems saturate something which stops them from repairing. Systems all loop aggressively bringing it all down.

mcpherrinm · on April 29, 2011

There's a quote I found interesting that hasn't been noted here yet:

"This required the time-consuming process of physically relocating excess server capacity from across the US East Region and installing that capacity into the degraded EBS cluster."

And if I read this description of the re-mirror storm correctly, I think that implies Amazon had to increase the size of it's EBS cluster in the affected zone by 13%, which considering the timeline seems fairly impressive.

assiotis · on April 29, 2011

I find it surprising that they did not and do not plan to employ any sort of interlocks/padded walls. What I mean is, if the system is exhibiting some very abnormal state (e.g #remirror_event above a fixed threshold or more than x standard deviations above average) then automated repair actions should probably stop and the issue should be escalated to a human.

neuroelectronic · on April 29, 2011

They will probably do that now. They will probably also make sure they have a powerful SOP for network upgrades as well.

thebootstrapper · on April 29, 2011

Reminds me again, Distributed System are hard and the first fallacies "The network is reliable"

thehodge · on April 29, 2011

An automatic 100% credit for 10 days usage, thats pretty good IMO

tomjen3 · on April 29, 2011

Well yes, except that that is usually peanuts compared to the lost income from your service being down.

Really the only purpose of a SLA penalty is to incentivize the provider to keep the network reliable.

com · on April 29, 2011

I totally agree with your comment about the SLA penalty as an incentive to the provider to take reasonable measures to ensure service.

But that's just in general.

When negotiating bespoke SLA penalty clauses, it can be very illuminating for both sides to discuss lost profit + lost confidence + additional costs to the customer and suggest that these be factored in to the penalty clause.

My experience: both the customer and supplier tend to take a deep breath to evaluate whether this deal is a good one for either of them and begin to reassess their level of risk.

In a off-the-shelf service like Amazon, you as a customer are welcome to suggest a change of penalty to your Amazon account manager, and unless you're something like the US government, you will probably be directed to other cloud providers or your own internal IT organisation!

cosmicray · on April 29, 2011

> In a off-the-shelf service like Amazon, you as a customer are welcome to suggest a change of penalty to your Amazon account manager, and unless you're something like the US government, you will probably be directed to other cloud providers or your own internal IT organisation!

What that suggests to me, is that the time has arrived for an external organization, one that sells loss-of-business protection against such failures, needs to become involved. Such an organization, should enough cloud customers subscribe to it, would become an influence upon services like AWS. I'm not sure I 'like' this idea, but the premise that a customer is using the cloud service at the whim of whatever the provider decides is best practice needs to be revisited.

RyanKearney · on April 29, 2011

I still don't know why people keep throwing this out there. Yes, if you were effected and it caused you to lose a lot of income because your business could not operate when EC2 was on the fritz then it's your fault, not Amazons. You can't keep blaming Amazon because you didn't build a fault-tolerant application. It's like blaming your electric company because your home is lit by one super huge flood light which burnt out and as a result you couldn't see or get any work done because you kept no backup bulbs in the house.

wanderr · on April 29, 2011

I think it's more like getting upset with your cable provider when their service goes down leaving you stranded with no internet. Arguably, it's your fault that you don't have any internet, you could have had redundant internet access between cable and DSL and you have no one else to blame for you decision to pay for only one.

dholowiski · on April 29, 2011

I agree, the credit is nice, although for me it amounts to about $6. The lost opportunity for me is huge... I have a QR code app, and missed the chance to have my QR code in a major print magazine ad. That's missed traffic, and missed exposure to the company who was going to use it - that was worth way more than $6 to me. Unfortunately as a hobby site, it's tough to justify multiple servers for redundancy

dedward · on April 29, 2011

You seem to be saying it was worth a lot to you and not worth much to you at the same time - that makes no sense.

rdl · on April 29, 2011

I still don't see a good justification for keeping the ebs control plane exposed to failure across multiple availability zones in a region. Until that is fixed, I would not depend on AZs for real fault tolerance.

moe · on April 29, 2011

Now that's what I call a post mortem. Kudos to the author.

johndbritton · on April 29, 2011

"We will look to provide customers with better tools to create multi-AZ applications that can support the loss of an entire Availability Zone without impacting application availability. We know we need to help customers design their application logic using common design patterns. In this event, some customers were seriously impacted, and yet others had resources that were impacted but saw nearly no impact on their applications."

epi0Bauqu · on April 29, 2011

They should also allow one-time moves of reserved instances between availability zones.

amock · on April 29, 2011

What would the purpose of that be?

epi0Bauqu · on April 29, 2011

They want to help people make better use of multiple availability zones. People may have reserved a bunch of instances, but would be better off distributing those more effectively across zones.

amock · on April 29, 2011

That makes sense. I wonder if something like the free realm transfers in World of Warcraft would make sense. Maybe the AZ mapping randomization keeps things balanced but encouraging multi-az deployments seems like a good idea.

mml · on April 29, 2011

Did I read this correctly in paragraph 2: " For two periods during the first day of the issue, the degraded EBS cluster affected the EBS APIs and caused high error rates and latencies for EBS calls to these APIs across the entire US East Region."

Their "control plane" network for the EBS clusters span availability zones in a region? If so, this would be the fatal flaw.

jsdalton · on April 29, 2011

I may have read it incorrectly myself, but I interpreted this as meaning the control plane was balanced across availability zones in order to provide durability in the face of a failure of one of the zones. In other words, Amazon is ensured their control plane is operational at all times.

The API failures were ultimately tied to the network problems that occurred, not to a failure of the control plane.

EDIT: I should finish reading before I reply. :) It would appear that the network issue in the one availability zone was so severe that the control plane ran out of threads to service API requests to any of the availability zones.

So while it's true the underlying problem was a network issue, the fact that the the control plane is spread across availability zones was responsible for part of the outage that occurred across the whole region.

My totally unqualified assessment of this aspect of the outage is that, while it might make sense to have a control plane spread across availability zones, they presumably need to have isolated control planes for each zone, instead of a shared plane as they seemingly have now.

evangineer · on April 30, 2011

They seem to have settled on a halfway house, pushing more of the control plane functionality down into the EBS clusters and making the remaining shared control plane more robust to the sort of failures that arose this time.

ra · on April 29, 2011

And a breach of the "share nothing" tenet, which is quite important here.

extension · on April 29, 2011

I don't see how transparently replicated storage could be implemented without sharing something.

hga · on April 29, 2011

The transparent replication is inside the availability zone; as I understand it, Amazon doesn't provide any sort of user visible direct sharing between multiple AZs, e.g. to copy or move an EBS you have to snapshot it first ... which was of course a control API they blocked during much of this mess.

ra · on April 29, 2011

In which case they shouldn't replicate across availability zones.

'shared nothing' is the only way to islandize failures.

space-monkey · on April 29, 2011

And that's why they have multiple fully-isolated regions. Availability zones are a purposeful tradeoff that provides easier to use service with higher inter-zone communication performance and lower cost.

ra · on April 30, 2011

No, that's not what Amazon says.

The following is from the AWS web site [1]:

> Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location.

No mention of tradeoffs.

[1] http://aws.amazon.com/ec2/

space-monkey · on April 30, 2011

The tradeoff is in that AZs are "engineered to be insulated" as opposed to being actually, or naturally isolated. Prior to their downtime, I've had plenty of conversations with folks that I work with about AWS and we've always assumed that AZs are not 100% isolated. I can see how someone can read "engineered to be insulated" the other way, but I generally read these kinds of materials as guaranteeing nothing beyond the most limited possible reading, and probably not even that.

The quoted statement doesn't say that isolation is 100% or that multiple AZs can't ever ever fail at the same time. It says that if only one AZ goes down and you have servers in another, then those servers will still be up, which should be obvious. Insulated doesn't even mean the same thing as isolated.

ra · on April 30, 2011

Sure, I don't disagree with what you are saying. However I think that the way Amazon presents the concept of an AZ is that it IS isolated from other AZ's in the same region.

Even the name, 'Availability Zone' implies that it is isolated from other 'Availability Zones' in the same region. And that text I quoted does nothing but substantiate that inference.

I just think that Amazon are misleading here. Maybe they shouldn't call it an Availability Zone.

joelhaasnoot · on April 29, 2011

Isn't this "flaw" a balance of features vs reliability? In most cases it's ok that the API spans the entire region, and makes it easy to address one API endpoint per region, allows Amazon to offer everyone different availibility zones within the region, etc.

epochwolf · on April 29, 2011

The issue being that availability zones don't get throttled when they start to overload the API. The control system should have automatically throttled the misbehaving AV.

Which is probably far more difficult to do properly than I can imagine.

krobertson · on April 29, 2011

To me, it sounds like a large single point of failure, and the post-mortem doesn't seem to acknowledge that or discuss remedying it.

They setup a separate instance of it to help with API calls in the affected region, but it still sounds like it functions across AZs and is still vulnerable overall.

adpowers · on April 29, 2011

Uhh, did you even read it? Under "Impact to Multiple Availability Zones", last paragraph:

"There are three things we will do to prevent a single Availability Zone from impacting the EBS control plane across multiple Availability Zones. The first is that we will immediately improve our timeout logic to prevent thread exhaustion when a single Availability Zone cluster is taking too long to process requests. … To address the cause of the second API impact, we will also add the ability for our EBS control plane to be more Availability Zone aware and shed load intelligently when it is over capacity. … Additionally, we also see an opportunity to push more of our EBS control plane into per-EBS cluster services. By moving more functionality out of the EBS control plane and creating per-EBS cluster deployments of these services (which run in the same Availability Zone as the EBS cluster they are supporting), we can provide even better Availability Zone isolation for the EBS control plane"

leoc · on April 29, 2011

Compare to the 2008 post-mortem: http://status.aws.amazon.com/s3-20080720.html Messaging infrastructure as single point of failure? Check. http://news.ycombinator.com/item?id=2472227

gojomo · on April 29, 2011

I doubt this is the last time we'll hear of a "re-mirroring storm" in an oversaturated cloud.

sliverstorm · on April 29, 2011

Oversaturated? How do you figure? 13% were unable to re-mirror, which means 87% were able to. In short, nearly 40% of the 'cloud' was free space.

gojomo · on April 29, 2011

The 're-mirroring storm' occurred when all free space was exhausted. At that point, the EBS storage resources were oversaturated.

Then the 'EBS control plane' started to fail because 'slow API calls began to back up and resulted in thread starvation'. At that point, the EBS processing resources were oversaturated.

Then other nearby systems got wet.

pwzeus · on April 29, 2011

I for once just want to say that claps to them for figuring this out , nailing it down in fixing it in just few days. After reading this if feels like issue at such massive level can take large amount of time to fix.

mauricio · on April 30, 2011

It's strange we haven't heard more from users of the 0.07% of EBS volumes that were corrupted and unrecoverable during the outage. I just assumed there was no data loss as a result of the outage.

AdamGibbins · on April 29, 2011

I found this rather entertaining: http://intraspirit.net/images/aws-explained.png

fawxtin · on April 29, 2011

Im getting a 404, is this the same? (Amazon service disruption "explained" by an employee)

http://lgv.s3.amazonaws.com/AmazonFail_explainedByEmployee.j...

mikiem · on April 29, 2011

The whole thing is just too complicated to be highly-available. There will be more problems, but I wish them luck.

ra · on April 29, 2011

That's a bit defeatist. In my first year of uni, one of our lecturers drummed something into us:

Q. How do you eat an elephant?

A. One bite at a time

VladRussian · on April 29, 2011

interesting, several weeks ago someone (reddit?) has already hit the problems with EBS availability. Did Amazon paid attention and analyzed the problem back then? Or let it just pass?

nodata · on April 29, 2011

tl;dr version?

chrisboesing · on April 29, 2011

An EBS node in a EBS cluster is connected to two networks. One is used for the traffic to and from the EBS volumes(Primary network), the other is used to replicate the EBS volume on a EBS node to a different EBS node(Secondary network). Amazon wanted to upgrade the capacity of the primary network. Their standard step doing this is to shift the traffic to a redundant router. This step was executed incorrectly. This resulted in the traffic not being routed to the primary network but instead to the secondary network which has less capacity. All this traffic satured the secondary network and resulted in the EBS volumes becoming "stuck". When the traffic got routed the right way all the EBS volumes were trying to remirror. Part of the remirroring process is that the EBS volumes search the cluster for free space to remirror to. The EBS cluster couldn't handle this load and new capacity was needed for the EBS cluster.

Amazon offers a 10 day credit equal to 100% of their usage of EBS Volumes, EC2 Instances and RDS database instances. This credit will be automatically applied to the next bill.

neuroelectronic · on April 29, 2011

The muppets in TechOps borked a switch switch.

jerhewet · on April 29, 2011

> tl;dr version?

Only a fool would try to run their business out of "the cloud".

gord · on April 29, 2011

This article reads like nonsense - but this is not a criticism of AWS.

The real problem is there is no good mathematical model of distributed behaviour, from which statistical guarantees can be made.

I think we're at the limit of what the smartest people can achieve with hand crafted code.

Most likely new math will give rise to new tools and languages, in which the next generation of reliable distributed systems will be written.

Without this advance we will have storage networks that aren't reliable, an internet that can be taken down by one organization, botnets that are unkillable and patchy network security.

brown9-2 · on April 29, 2011

I'm curious - what about this post-mortem "reads like nonsense"?

gord · on April 29, 2011

All of it, the general approach is wrong - so the nonsense is the part where you believe your current set of abstractions about distributed networks are adequate.

Also the part where you reapply those same abstractions to fix the hole, without realizing that the problem is you simply don't yet have tools that are capable of writing a robust system - despite the evidence to the contrary.

If a day long outage of this scale is not enough to make us rethink distributed systems, what is?