AWS is down, but here's why the sky is falling

akashs · on April 21, 2011

Amazon makes it pretty clear that Availability Zones within the same region can fail simultaneously. In fact, a Region being down is defined as multiple AZs within that zone being down according to the SLA. And since that 99.95% promise applies to Regions and not AZs, multiple AZs within the same region being down will be fairly common.

Edit: One more point. In the SLA, you'll find the following: “Region Unavailable” and “Region Unavailability” means that more than one Availability Zone in which you are running an instance, within the same Region, is “Unavailable” to you. What it implies is that if you do not spread across multiple Availability Zones, you will then have less than 99.95% uptime. So spreading across AZs should still reduce your downtime, just not beyond that 99.95%

http://aws.amazon.com/ec2-sla/

justinsb · on April 21, 2011

I have to disagree with you. The SLA is just a legal agreement that really serves to limit AWS's liability. Here's what the main EC2 page says:

"Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location."

http://aws.amazon.com/ec2/

That's the spec that everyone was building to, but that isn't what is happening. Of course you're right, multiple AZs can fail at the same time, but I read the above as saying that they should fail independently/coincidentally (until the entire Region fails).

bphogan · on April 21, 2011

We always, always use the SLA offered by a vendor as the basis for our information. We trust it more than any marketing page, sales pitch, tech support FAQ, or anything else. That's what they'll hide behind, so that's what I'll have in mind when I design my setup.

justinsb · on April 21, 2011

I think it's great to check the SLA. However, there's enough wiggle room in the AWS SLA that I think this outage could continue for the rest of the month, and Amazon would still not owe a penny. I don't even know that the SLA covers this outage, because network connectivity isn't affected.

Even if Amazon breach their SLA, I think they only have to refund 10% of one month's bill per year - i.e. a 1% discount. I suspect they'd make a good profit even if they paid out a full 10% refund every month.

Unless an SLA is accelerated - i.e. >100% refund - I don't think it's worth taking particularly seriously.

Of course if an SLA only guarantees 95% uptime, that's probably a big hint to design for failure!

bphogan · on April 21, 2011

Yeah but I don't care about getting my money back as much as I care about how much they claim to be down.

It's like the hard disk maker that gives you a 1 year warranty vs a 5 year warranty... which one believes in their product more? :)

justinsb · on April 21, 2011

It's a good analogy and I certainly accept your point. It could just be a marketing thing though:

Suppose it's the same hard disk with a black sticker instead of a blue sticker. Drive with 1 yr warranty @ $100, 5 yr warranty @ $150, 20% additional failure rate over the extra 4 years, 50% redemption rate on failed drives. Cost per replaced drives = 20% * 50% * ($100 + $30 processing costs) = $13 = $37 profit.

Totally fictitious numbers to try to prove my point, of course :-) But as the SLA becomes increasingly low in value, the signalling value decreases in my book.

(Edit - fixed my math!)

marshray · on April 21, 2011

It tells you very little.

One of them may be planning to be out of business, sell the HD business unit in 2 years, shove off the risk via financial wizardry, etc.

My guess is the great majority users will not RMA a dead hard drive after 4.5 years regardless of the stated warranty. Even if they did, it would only represent replacement with a future smallest-possible-capacity drive.

polynomial · on April 24, 2011

> here's enough wiggle room in the AWS SLA that I think this outage could continue for the rest of the month, and Amazon would still not owe a penny.

While agreeing it's not about the money it's about my site being up, I nevertheless was pretty shocked by this statement.

jeresig · on April 21, 2011

"Of course you're right, multiple AZs can fail at the same time, but I read the above as saying that they should fail independently/coincidentally."

As far as I know we've heard nothing to the contrary from Amazon - it's totally possible that multiple AZs happened to fail independently/coincidentally. Perhaps it was simultaneous equipment failure? Or maybe one AZ failed and a sufficient number of people attempted to "fail over" to another AZ causing a chain reaction of failure?

justinsb · on April 21, 2011

It is possible. I think it's exceptionally unlikely.

The one bit of information we have suggests that the root cause was a networking issue, which suggests SPOF.

leoc · on April 21, 2011

If I had to, I'd guess that AWS's messaging/monitoring/control infrastructure is likely to be the SPOF, as in the 2008 outage: http://status.aws.amazon.com/s3-20080720.html It's an obvious weak spot in the independence/isolation of AWS' nodes, and it would seem to be the one most likely to cause failure to reach across more than one AZ. (Apart from a stampede from one, affected AZ to others perhaps.)

pessimist · on April 21, 2011

I'm sorry, designing your service without taking in to account the SLA is just stupid. See how Netflix survived the failure for example.

Now if you understand the SLA and still choose not to do cross-region deployments, then you've taken a cost/complexity vs uptime trade-off, which may well be right for you. quora.com probably is ok - who cares if its down for a day?

jpdoctor · on April 21, 2011

The SLA uses great legal weasel words: "AWS will use commercially reasonable efforts to make Amazon EC2 available"

So anything that is beyond commercially reasonable is outside the SLA.

In truth, as with all businesses, the reputation for uptime weighs more heavily than the written contract. It will be interesting to see how the AWS people attempt to make amends.

tomkarlo · on April 21, 2011

"Commercially reasonable" is a standard legal term used to define efforts short of "best efforts". It allows for the party also look out for its own commercial interest in a way that's consistent with industry practice. So, for example, if Amazon had to choose between fulfilling the SLA and keeping it's own retail site up, it could be held liable under a "best efforts" standard but not under a "commercially reasonable" standard.

It's kind of unfair to describe these as "weasel words" when it's unlikely that any decent lawyer would let them sign up to something that exposes them to more liability than this. Customers who are using any cloud service provider have to expect reasonable steps to maintain availability, not an absolute promise.

mdasen · on April 21, 2011

Amazon has probably correctly designed core infrastructure so that these things shouldn't happen if you're in multiple Availability Zones. I'm guessing that means different power sources, backup generators, network hookups, etc. for the different Availability Zones. However, there's also the issue of Amazon's management software. In this case, it seems that some network issues triggered a huge reorganization of their EBS storage which would involve lots of transfer over the network of all that stored data, a lot more EBS hosts coming online and a stampede problem.

I've written vigorously (in previous comments) for using cloud servers like EC2 over dedicated hosting like SoftLayer. I'm less sure about that now. The issue is that EC2 is still beholden to the traditional points of failure (power, cooling, network issues). However, EC2 has the additional problem of Amazon's management software. I don't want to sound too down on Amazon's ability to make good software. However, Amazon's status site shows that EBS and EC2 also had issues on March 17th for about 2.5 hours each (at different times). Reddit has also just been experiencing trouble on EC2/EBS. I don't want this to sound like "Amazon is unreliable", but it does seem more hiccup-y.

The question I'm left with is what one is gaining from the management software Amazon is introducing. Well, one can launch a new box in minutes rather than a couple hours; one can dynamically expand a storage volume rather than dealing with the size of physical discs; one can template a server so that you don't have to set it up from scratch when you want a new one. But if you're a site with 5 boxes, would that give you much help? SoftLayer's pricing is competitive against EC2's 1-year reserved instances and SoftLayer throws in several TB of bandwidth and persistent storage. Even if you have to over-buy on storage because you can't just dynamically expand volumes, it's still competitively priced. If you're only running 5 boxes, the server templates aren't of that much help - and virtually none given that you're maybe running 3 app servers, and a replicated database over two boxes.

I'm still a huge fan of S3. Building a replicated storage system is a pain until you need to store huge volumes of assets. Likewise, if you need 50 boxes for 24 hours at a time, EC2 is awesome. I'm less smitten with it for general purpose web app hosting where the fancy footwork done to make it possible to launch 100 boxes for a short time doesn't really help you if you're looking to just have 5 instances keep running all the time.

Maybe it's just bad timing that I suggested we look at Amazon's new live streaming and a day later EC2 is suffering a half-day outage.

showerst · on April 22, 2011

I'm responsible for a relatively large site ( http://www.foreignpolicy.com ) that was down for 12+ hours over this failure today.

One fallacy that I think that many people make in the whole cloud debate is the idea that a given cloud provider is any more or less failure prone than a given dedicated server host.

We have assets on Amazon, Slicehost, and Linode. Sometimes these go down, whether it's our fault, software's fault, hardware's fault, or a construction crew hitting a fiber drop, things happen. If you're not backed up in a fully tested way on not just another server or availability zone, but whole different hosting infrastructure (preferably in a different time zone), then you're not really backed up. Being on a host like Amazon, or even a fully managed host like a Cadillac Rackspace plan doesn't remove the need for good BCP.

What these cloud services allow you to do in theory is have that backup infrastructure ready to go on relatively short notice _without_ keeping it running all the time. We can't reasonably afford to replicate all of our servers and hot data to Western Region or the Rackspace cloud 24/7. We can, however, afford to set up the infrastructure and spin it up on the fly within an hour with slightly stale data once a month to test it, and when for things break. Requisitioning that kind of hardware and then dumping it for only a few tens of dollars a month is difficult if not impossible on a virtual host.

The big question is not 'Is the cloud more reliable?', but 'Do i need what only the cloud can offer?'. If your current infrastructure can handle getting drudged or reddited fine, and you're only on a few servers, you're probably better off just paying to keep a hot spare up at softlayer.

On the other hand if you have 1) Occasional traffic bursting that you don't want to pay to handle most days and 2) Can accept a few minutes of downtime, then the solutions offered by cloud hosts blow the competition out of the water. I guess what you're gaining is not the management software, it's the ability to turn off & on quickly when something goes wrong (or, in the case of a redditing, right).

Part of figuring out the right hosting solution involves asking the right questions.

(..and for reference, we were all ready to go with a backup... and then we learned that our hosting company was storing our nightlies on S3 and couldn't retrieve them, and that our offsite DB solution was having an unrelated issue). Had we run proper tests (I'm brand new to the job), we would've been ready for this one. I also worry big time about DNS and load balancing being a big SPF, but that's a plan for another day.

dasil003 · on April 21, 2011

What about hardware failure? On AWS you just commission a new instance and your downtime is minutes rather than hours, plus you don't have to keep extra hardware on hand just to avoid downtime of days. There are also smaller more localized issues like network switch failure and other things that you probably never even notice on Amazon, but might be more likely to bite you on a dedicated host.

If an AWS data center goes down it gets a lot of press, but does it actually outweigh the sum of all dedicated/shared/vps hosting issues on the equivalent volume?

gtuhl · on April 21, 2011

There are some nice middle options out there. I'll use Softlayer as an example as I have provisioned a lot of machinery over there.

I can order machines online and SSH in 3-4 hours later. Even exotic stuff they turn around just as fast - we saw that speed on a quad octocore box with a raid 10 of Intel SSDs.

That's real metal too, with real IO (most of my work is IO bound so VMs and the cloud are not options). You get to pick the exact CPUs, disks, etc and they slot them in solid Super Micro boards and use good Adaptec disk controllers. You pay monthly and can spin down the box at any time (though must pay full months, no per-minute pricing like AWS).

That is on the dedicated hardware side, you can also spin up compute instances and those can be cloned and fired up in bulk. But, they also have the IO problems that all other VMs have.

In any case, just wanted to mention they are a decent middle ground. Not as automated and polished as Amazon on the VM side but you can spin up mixtures of metal and VMs to get combinations that make sense - pushing compute or RAM-only stuff to VMs and keeping DBs and persistence layers on real metal. They have a few different datacenters too so you can spread gear around physical locations.

ericd · on April 22, 2011

I'm fairly sure that my downtime due to a hardware failure at softlayer would be less than the downtime AWS has had for huge numbers of people this year. And hardware failures on a given server happen less frequently than 1/year on average.

Problems are just not as common if you're running on a handful of dedicated machines, and a single dedicated machine at a good host can handle a LOT without having to do all the crazy reliability engineering that running on AWS requires. You need backups, but you don't need that same assumption that you need to be able to failover instantly or you will have guaranteed downtime sometime soon. I don't think that that difference can be overstated, since it lets you focus on more important things.

dangrossman · on April 21, 2011

Speaking of Softlayer specifically, they've diagnosed then replaced failed hardware for me (hard disks and power supplies so far) in 15-30 minutes from the time I opened a support ticket. One of the incidents was around 2AM local time where the server is and their response time was the same.

mscarborough · on April 22, 2011

For entities that have the CapEx money to build out their own hardware to handle expected growth, and do it a little cheaper due to volume, does it still make sense to engage in the cloud game?

Or is it a better option when you are starting up, and want to be able to quickly throw hardware at a problem, should the need arise?

Apologies if this sounds like a pretty ignorant question, but I haven't implemented cloud-based services before. It seems like there is a hardware cost vs. people cost due to the newer nature of AWS and the like, and that needs to be factored into development / maintenance time.

Saving people time by relying on a known quality like arrays of Linux servers with failure tolerance seems preferable.

devenson · on April 22, 2011

"Mutually exclusive" zones may all depend upon the same administrators, same decision making, same software, same architecture design.

jsprinkles · on April 21, 2011

I agree with your entire comment with the exception of one sentence. Disagree as strongly as I can here:

I've written vigorously (in previous comments) for using cloud servers like EC2 over dedicated hosting like SoftLayer. I'm less sure about that now.

An issue at Amazon, or Rackspace, or Linode, or Slicehost need not imply failure at other providers and cloud as an alternative to dedicated is still as viable as ever. Amazon tanking does not mean everybody needs to run back to dedicated, and my pet peeve is that when one provider takes a crap everyone paints the cloud as toxic.

When ThePlanet's facility exploded a few years ago I did not hear lamenting that dedicated hosting was doomed. When an airliner crashes we do not say air travel is doomed. I do not understand why people rush to paint cloud as a toxic choice in light of a failure of a certain player. Admittedly a big one but there are others too and you can move.

Providers like Linode are almost exactly equivalent to dedicated hosting. They just administer the hardware for you and pay the remote hands bills. Same for Slicehost and Rackspace. It is simply far easier to wipe your instance and start over and for all intents and purposes it acts like a dedicated box. You need to administer it like one too. Most failures of the "cloud" are really designing your application in violation of the fallacies linked elsewhere.

btmorex · on April 21, 2011

Show me one cloud offering that gives consistently good or even average disk performance. (hint: there isn't one)

Basically, if you're running a database that does not completely fit in memory you should be on dedicated hardware.

I'd also point out that a lot of advantages that people routinely cite as cloud strengths are more about cloud vs traditional hosting or colocation as opposed to cloud vs a place like softlayer. softlayer can provision a custom build in a few hours (yeah vs. minutes, but who really cares that much) and you pay month-to-month without a contract.

moe · on April 21, 2011

Show me one cloud offering that gives consistently good or even average disk performance.

You mean like newservers.com, SoftLayer Bare Metal Cloud, stormondemand, or one of the other metal clouds?

asharp · on April 22, 2011

I believe the orionvm cloud would qualify as "good or even average disk performance" (http://orionvm.com.au/blog/3rd-Party-Performance-Benchmarks/ Benched by cloudharmony.com), as they are 60% faster than a dedicated server with 4 * 15k SAS disks.

Disclaimer: I'm a director at orionvm.

mmalone · on April 23, 2011

I run a database that does not fit in memory on AWS with great success. Generalizing is dangerous. If you work with and understand the constraints imposed by the environment you can do some pretty amazing stuff.

jsprinkles · on April 21, 2011

I have found Linode disks to have the best performance in all

http://pastebin.ca/2049137

You are correct, I/O is the challenge in administering systems in a virtual environment. My database which does not fit in memory does fine on a high-load site because I cache it responsibly. For comparison, here are awful results from a new player called ChunkHost, who I signed up for with the purpose of testing

http://pastebin.ca/2049142

The sequential write throughput there is troubling. This comparison from a couple years back is interesting too

http://journal.uggedal.com/vps-performance-comparison/

I've linked this URL before but it really does the best of breaking it down. What cloud providers have you tried? In my experience there are vast gaps between certain ones, Amazon no exception. Hard to stereotype cloud with gaps like those.

Even if SoftLayer could provision me a new box in ten minutes the improvement to my sleep from not waking up for every disk failure and submitting a remote hands ticket at who knows how much per pop far outweighs anything else.

whakojacko · on April 21, 2011

I have also had good experience with linode performance (just a small instance I use for a few personal projects). However, AFAIK linode is just using local on-box disk, which is a whole different animal from EBS.

commx · on April 21, 2011

To say that a database has to fit entirely in memory to achieve good performance is a ridiculous proposition, and simply shows you have zero actual knowledge of modern database server internals or administration. Countless sites happily serve oodles of pageviews per day with actual memory usage far below the disk space used by their databases. Hint: they're not swapping, either.

In general, if you really believe what you're saying, you either (1) have a very poorly designed application, (2) have a very poorly designed database environment, or (3) are speaking to a specialized application that wouldn't reflect the majority of environments operating in real life. This isn't to say it isn't a combination of these options, mind you. I didn't even start on utilizing caching in applications, because it's clear there are other hurdles to overcome first.

assiotis · on April 21, 2011

I don't think he is saying that. He is saying that in a non-dedicated environment, you share the same spindles with other tenants who may have different I/O access patterns than your application. Careful choice of indexes, good data locality for fast reads, making sure writes are sequential - all that goes out the window if some other application is causing the disk to seek all over the place.

mbreese · on April 21, 2011

Yeah, but Amazon is the 800 lbs gorilla in the room when it comes to the cloud. When most people say 'the cloud' they are referring to Amazon. So when Amazon has an issue, rightly or wrongly, the e tire sector gets a black eye.

phlux · on April 21, 2011

What I find really interesting is the implication that such outage could have on Amazons business model. Specifically, what I would like to see is transparent complete application duplication to other regions & availability zones for certain customer configurations of particular sizes etc.

The application would be transparently mirrored to another region and if an even such as this occurs, the mirror would be spun up.

The Customer would choose the frequency of snapshot desired, and would pay for that.

Certain sites, with less dynamic content, would be mirrored and continue to operate as normal with minimal impact or cost.

Other sites, where the content creations is fairly real-time from its users, would pose more complex and costly mirroring situations (ala reddit).

But the option should be there.

Also, remember to think of the evolution of amazons services say, 24 months from now, where this type of offering will likely become more a reality.

As too many others have noted, it is best to not be 100% reliant on amazon for your entire services - but at this point in time its a little hard to spread the load between competing offerings to AWS/EC2 etc.

dsl · on April 21, 2011

AWS is a bucket of legos, it is up to you to be smart enough to build something.

The option IS there. I know, because I had zero downtime today and am 100% on AWS.

justinsb · on April 21, 2011

A quick tldr: Availability Zones within a Region are supposed to fail independently (until the entire Region fails catastrophically). Any sites that designed to that 'contract' were broken by this morning's incident, because multiple AZs failed simultaneously.

I've seen a lot of misinformation about this, with people suggesting that the sites (reddit/foursquare/heroku/quora) are to blame. I believe that the sites were designed to AWS's contract/specs, and AWS broke that contract.

js2 · on April 21, 2011

The contract to which you refer is entirely inferred, is it not? Amazon claims the AZ's should be independent[1]:

Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone.

Yet what Amazon guarantees, by way of their SLA, is only 99.95% for a region[2,3]:

The Amazon EC2 SLA guarantees 99.95% availability of the service within a Region over a trailing 365 day period.

[1] http://aws.amazon.com/ec2/faqs/#How_isolated_are_Availabilit...

[2] http://aws.amazon.com/ec2/faqs/#What_does_your_Amazon_EC2_Se...

[3] Of course, they're not even meeting that right now. :-(

justinsb · on April 21, 2011

Ah - sorry! I don't mean a legal contract, I mean more of a technical contract. e.g. "I won't pass a null pointer" style contract.

In fact, the first bit you quoted provides an even stricter technical contract than the one on the main EC2 page - it states some degree of natural disaster tolerance, heavily suggesting separate datacenters (not just different floors). Thanks for pointing that out.

Whatever the common point of failure turns out to be, it does seem to have been shared across AZs, in violation of their FAQ.

jpdoctor · on April 21, 2011

Every time someone bitched at me for not having a "cloud-based strategy", I kept asking how many 9s of reliability they thought the cloud would deliver.

We're down to 3 nines so far. A few more hours to 2 nines.

The cloud is not for all businesses.

rufo · on April 21, 2011

"The cloud", as I understand it, is the ubiquitous, cheap and near-instantaneous availability of computing power; as in minutes instead of hours or days for new servers.

"The cloud" is not (and never has been) a cure-all for reliability issues. It's just as easy to have single points of failure as any other hosting strategy, and is just as easy (or difficult) to plan for. Companies that have planned for high availability with multi-region or multi-provider strategies will continue to be available, regardless of whether or not they are using "the cloud".

jpdoctor · on April 21, 2011

> near-instantaneous availability

That implies something about reliability. The downtime today is real data about that availability.

rufo · on April 21, 2011

That's an issue with one service in one region offered by Amazon Web Services, not "the cloud" as a concept.

Use this as an example of the reliability of EBS (or if you want to broaden the scope, Amazon Web Services) all you want, but this says nothing about "the cloud" as a concept.

moe · on April 21, 2011

I kept asking how many 9s of reliability

That's a nonsensical question to ask.

If your business is amongst the chosen few that can justify the cost to guarantee any number of nines then your availability strategy involves multiple vendors anyways.

The cloud is not for all businesses.

Whether Amazon can be part of an availability strategy has nothing to do with the number of nines.

xyzzyb · on April 21, 2011

We have our business website hosted out of the Amazon cloud. Our primary servers are actually located in their affected data center. But we also have a great data team behind it, so we aren't being (to the outside observer) affected at all by the outage.

Cloud is vulnerable? Of course it is. So plan accordingly.

drm237 · on April 21, 2011

The same steps you would take in your own datacenters to ensure high availability would work in the cloud to ensure the same availability so I'm not sure what your point is. Measuring the availability of a few zones from one provider and broadly labeling the cloud as unreliable is a flawed argument. Netflix, for example, is entirely on AWS and is still running well today.

jpdoctor · on April 21, 2011

> The same steps you would take in your own datacenters to ensure high availability would work in the cloud to ensure the same availability

If I'm engineering the same steps in the cloud as I am in the data center, then I'm going to skip a step and just engineer the data center, because adding machines on demand is not rocket science. But maybe that's just me.

Daishiman · on April 21, 2011

Do you already have a proven API and hardware provider who will provide you with said machines? Got them racked up and powered? If not, there's something missing.

ChuckMcM · on April 21, 2011

Hmm, "bitched at you" has the ring of feeling persecuted because you didn't jump up all dreamy eyed at the latest buzz word trend. Occupational hazard I suspect.

If someone says to you "We need to improve the efficiency of our IT by adopting a cloud based strategy." Rather than ask them the 'meta' question of what sort of reliability guarantees they have, have an actual and honest talk about what IT costs and why. And perhaps they will relax their uptime requirement which will let you reduce your costs, or they will come to understand what the costs are for the level of uptime you're providing. Annual reviews of those questions (how much downtime can we tolerate, how much are we paying to achieve our current availability?) should be de rigueur.

"The cloud is not for all businesses."

Of course it isn't. However it can (and does) run some businesses more efficiently. And while Quora might be down for a day while folks at Amazon scramble to fix what ever it is they did that brought it down, their "business" won't change all that much. There will be no mass exodus of users because they could get their questions answered for one day. Now if you take someone's email away for a day, that is real money, or if you take away their ability to connect to the Internet period.

For something like icanhazcheeseburger even two 9s is probably good enough. That would be offline for 3.6 days of the year.

risotto · on April 21, 2011

These outages are very rough. Clearly a lot of the Internet is building out on AWS, and not using multiple zones correctly in the first place. But AWS can have multi-zone problems too as we see here. Nobody is perfect.

But what people forget is: AWS has a world class team of engineers first fixing the problem, and second making sure it will never happen again. Same with Heroku, EngineYard, etc.

Host stuff on dedicated boxes racked up somewhere and you will not go down with everyone else. But my dedicated boxes on ServerBeach go down for the same reasons: hard drive failure, power outages, hurricanes, etc. And I don't have anyone to help me bring them back up, nor the interest or capacity to build out redundant services myself.

My Heroku apps are down, but I can rest easy knowing that they will bring them back up with out an action on my part.

The cloud might not be perfect but the baseline is already very good and should only get better. All without you changing your business applications. Economy of scale is what the cloud is about.

ANH · on April 22, 2011

The cloud might not be perfect but the baseline is already very good and should only get better.

Do we have reason to believe that it will only get better? I think it's possible the complexity of the systems we are building and the traffic they encounter will outpace our ability to manage them. Not saying I think it's the most likely outcome, but I don't feel as confident as you.

risotto · on April 22, 2011

Food for thought for sure. True, nothing can get better forever...

But do we believe in "economy of scale" for computer and Internet systems in this age? Google, Amazon, Facebook, etc. have already proven to me that they have enough human and financial capital to architect and run systems that show economies of scale.

It's a bit scary to think about what it will mean when this runs out, but for now I personally feel confident that things are getting much better, and will continue to do so.

weswinham · on April 21, 2011

I'd say your choice between Quora's engineers being incompetent or AWS being dishonest/incompetent is a completely false dichotomy. Anyone who has been around AWS (or basically any technology) will agree that the things that can really hurt you are not always the things you considered in your design. I just can't believe that many of the people who grok the cloud were running production sites under the assumption that there was no cross-AZ risk. They use the same API endpoints, auth, etc so it's obvious they're integrated at some level.

Perhaps for Quora and the like, engineering for the amount of availability needed to withstand this kind of event was simply not cost effective, but I seriously doubt the possibility didn't occur to them. It's not even obvious to me that there are many people who did follow the contract you reference who had serious downtime. All of the cases I've read about so far have been architectures that were not robust to a single AZ failure.

As for multi-az RDS, it's synchronous MySQL replication on what smell like standard EC2 instances, probably backed by EBS. Our multi-az failover actually worked fine this morning, but I am curious how normal that was.

endergen · on April 22, 2011

Read how @learnboost who uses AWS was not affected by the AWS outages because of their architecture design: http://blog.learnboost.com/blog/availability-redundancy-and-...

nulljangles · on April 22, 2011

Very interesting. If I'm reading this correctly, though, if all 4 Availability Zones that they're replicated across were to have gone down, though, they would've been in the same boat.

EGreg · on April 21, 2011

This is again the problem with centralized vs distributed services. Not just Amazon's infrasturcture.

http://myownstream.com/blog#2010-05-21 :)

billswift · on April 21, 2011

Good essay. Coming up with a workable, decentralized alternative for domain name registrars is even more than decentralized social apps though.

cafebabe · on April 21, 2011

Relations. At the viewpoint of a non-cloud-user, this is a pretty normal situation. Systems fail. Maybe, we should think about cloud as a service, that is managed somehow different (to enable easier access to our wallets and budgets) but do eventually fail the same way as standard services. That's how I saw it as the first headline about cloud services appeared in front of me couple a years ago.

grandalf · on April 21, 2011

It's pretty wild that this stuff happens. Similar to today's nasty outage, Google has had some massive problems with its app engine datastore...

I'm curious if anyone has any predictions about what the landscape will be like in a few years? Will these be solved problems? Will cloud services lose favor? Will everything just be designed more conservatively? Will engineers finally learn to read the RTFSLA?

dendory · on April 21, 2011

The benefits of the cloud is just too great, we won't go back. Except in a few years, when something goes down, instead of being some random site who's down, it's going to be the 20,000 sites that are hosted on that hardware.

TheAmazingIdiot · on April 22, 2011

Now, different clod providers "speak" different languages now. But I can see in 5 or so years that the cloud will speak a similar set of languages. One could use a storage cloud from this provider and a CPU cloud from another provider.

I could eventually see, with help from functional languages like Lisp or Erlang, a intra-company cloud running on and between networks. CPU could be bought from 3 providers, and storage could be bought from 4 providers, with GPU acceleration clusters when big data needs crunched quickly.

Or right now, companies can make their own clouds via Eucalyptus. Don't want Amazon to hold your keys? Load balance between Your cloud and Amazon's.

ww520 · on April 21, 2011

One data point. I have one of my clients' servers in the east-1d availability zone. East coast region, zone d. So far things are holding up, no crash or no slow down. Fingers crossed.

shykes · on April 21, 2011

Note: your "zone d" is not my "zone d". AWS shuffles zone IDs across users. See http://alestic.com/2009/07/ec2-availability-zones

enjo · on April 21, 2011

We're spread throughout the eastern region and haven't had an issue yet. Lucky for us:)

wslh · on April 22, 2011

I use dreamhost and never had a failure like the Amazon one.

It's an irony.

kovar · on April 22, 2011

Oh, but I've had all sorts of other fun failures with dreamhost in the early days. A number of us regularly called it "dreamhose". It seems to have matured, and I keep some material on there, but I'm still wary of putting anything mission critical on it.

mkramlich · on April 22, 2011

I had one client project on DreamHost, and never again. My experience with them was that even when it was "up" it could be down. Lots of mysterious glitches and weirdness, stomping, restarts. Not 100% sure it was their fault. But didn't see any definite evidence it was ours either. In comparison, WebFaction and Linode have been great. Though I settled on Linode for all new projects for several reasons that I felt made them better in the general case.

KeyBoardG · on April 21, 2011

The ending of this article came of very slanderous rather than just a report of why the problem occured. Keep it.

on April 21, 2011

[deleted]

parfe · on April 21, 2011

Reddit goes down when a butterfly in India flaps her wings.

rubyskills · on April 21, 2011

Hmm not sure why I got downvoted on this.. I didn't mean that Reddit went down but that they froze the creation of new content because of it.

"reddit is in "emergency read-only mode" right now because Amazon is experiencing a degradation. they are working on it but we are still waiting for them to get to our volumes. you won't be able to log in. we're sorry and will fix the site as soon as we can."

elliottcarlson · on April 21, 2011

If anything - and this may well be due to their experience with faults in the infrastructure - it shows that they have a fallback as opposed to the other services that are completely unavailable and have no way to continue business at some form of capacity and are just playing a waiting game at this point. Credit goes out to Reddit in this scenario imho.

dwwoelfel · on April 21, 2011

I downvoted because the article points out that Reddit is down in the first paragraph.

rubyskills · on April 21, 2011

That is a good point. I wonder how much of their infrastructure is actually on EC2.

rubyskills · on April 21, 2011

@dwwoelfel Haha. That's my bad then since I didn't actually read the article and was just frustrated that part of Reddit was down because of Amazon..

thx for the explanation.

jsprinkles · on April 21, 2011

The Reddit people do read HN and do what they can with the resources and manpower available to them. It disgusts me to see so much knocking of Reddit on HN particularly when Amazon is usually the problem.

I would upvote your cheap shot if you administered a top-200 site with a technical staff of three.

rubyskills · on April 21, 2011

I figured. :(

I don't have much karma to begin with so I didn't mean to "offend" anyone. Just thought it would be interesting discussion since I tried to post to Reddit today and realized it wasn't possible. Thanks for helping me recover some of my karma back :)

jsprinkles · on April 21, 2011

I was responding to parfe not you.

parfe · on April 21, 2011

>The Reddit people do read HN

Well they certainly weren't reading Reddit.

A cheap shot is making fun of a person for having a birth defect or a dead mother, but ok, I guess I came a little close with this last one. I'll be sure to be nicer to the $10 million corporation.

neuroelectronic · on April 21, 2011

What if he administers a fortune 500 with a staff of 33k and a client list in the millions?

jsprinkles · on April 21, 2011

Then I still wouldn't upvote his cheap shot? Nothing really hidden in that sentence.

delvan07 · on April 21, 2011

Crazy how that crashed and brought other sites like Reddit, Quora etc down.

ceejayoz · on April 21, 2011

Given that those sites are hosted on EC2, it's no more crazy than blowing up a car resulting in killing its passengers.