Amazon makes it pretty clear that Availability Zones within the same region can ...

justinsb · on April 21, 2011

I have to disagree with you. The SLA is just a legal agreement that really serves to limit AWS's liability. Here's what the main EC2 page says:

"Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location."

http://aws.amazon.com/ec2/

That's the spec that everyone was building to, but that isn't what is happening. Of course you're right, multiple AZs can fail at the same time, but I read the above as saying that they should fail independently/coincidentally (until the entire Region fails).

bphogan · on April 21, 2011

We always, always use the SLA offered by a vendor as the basis for our information. We trust it more than any marketing page, sales pitch, tech support FAQ, or anything else. That's what they'll hide behind, so that's what I'll have in mind when I design my setup.

justinsb · on April 21, 2011

I think it's great to check the SLA. However, there's enough wiggle room in the AWS SLA that I think this outage could continue for the rest of the month, and Amazon would still not owe a penny. I don't even know that the SLA covers this outage, because network connectivity isn't affected.

Even if Amazon breach their SLA, I think they only have to refund 10% of one month's bill per year - i.e. a 1% discount. I suspect they'd make a good profit even if they paid out a full 10% refund every month.

Unless an SLA is accelerated - i.e. >100% refund - I don't think it's worth taking particularly seriously.

Of course if an SLA only guarantees 95% uptime, that's probably a big hint to design for failure!

bphogan · on April 21, 2011

Yeah but I don't care about getting my money back as much as I care about how much they claim to be down.

It's like the hard disk maker that gives you a 1 year warranty vs a 5 year warranty... which one believes in their product more? :)

justinsb · on April 21, 2011

It's a good analogy and I certainly accept your point. It could just be a marketing thing though:

Suppose it's the same hard disk with a black sticker instead of a blue sticker. Drive with 1 yr warranty @ $100, 5 yr warranty @ $150, 20% additional failure rate over the extra 4 years, 50% redemption rate on failed drives. Cost per replaced drives = 20% * 50% * ($100 + $30 processing costs) = $13 = $37 profit.

Totally fictitious numbers to try to prove my point, of course :-) But as the SLA becomes increasingly low in value, the signalling value decreases in my book.

(Edit - fixed my math!)

marshray · on April 21, 2011

It tells you very little.

One of them may be planning to be out of business, sell the HD business unit in 2 years, shove off the risk via financial wizardry, etc.

My guess is the great majority users will not RMA a dead hard drive after 4.5 years regardless of the stated warranty. Even if they did, it would only represent replacement with a future smallest-possible-capacity drive.

polynomial · on April 24, 2011

> here's enough wiggle room in the AWS SLA that I think this outage could continue for the rest of the month, and Amazon would still not owe a penny.

While agreeing it's not about the money it's about my site being up, I nevertheless was pretty shocked by this statement.

jeresig · on April 21, 2011

"Of course you're right, multiple AZs can fail at the same time, but I read the above as saying that they should fail independently/coincidentally."

As far as I know we've heard nothing to the contrary from Amazon - it's totally possible that multiple AZs happened to fail independently/coincidentally. Perhaps it was simultaneous equipment failure? Or maybe one AZ failed and a sufficient number of people attempted to "fail over" to another AZ causing a chain reaction of failure?

justinsb · on April 21, 2011

It is possible. I think it's exceptionally unlikely.

The one bit of information we have suggests that the root cause was a networking issue, which suggests SPOF.

leoc · on April 21, 2011

If I had to, I'd guess that AWS's messaging/monitoring/control infrastructure is likely to be the SPOF, as in the 2008 outage: http://status.aws.amazon.com/s3-20080720.html It's an obvious weak spot in the independence/isolation of AWS' nodes, and it would seem to be the one most likely to cause failure to reach across more than one AZ. (Apart from a stampede from one, affected AZ to others perhaps.)

pessimist · on April 21, 2011

I'm sorry, designing your service without taking in to account the SLA is just stupid. See how Netflix survived the failure for example.

Now if you understand the SLA and still choose not to do cross-region deployments, then you've taken a cost/complexity vs uptime trade-off, which may well be right for you. quora.com probably is ok - who cares if its down for a day?

jpdoctor · on April 21, 2011

The SLA uses great legal weasel words: "AWS will use commercially reasonable efforts to make Amazon EC2 available"

So anything that is beyond commercially reasonable is outside the SLA.

In truth, as with all businesses, the reputation for uptime weighs more heavily than the written contract. It will be interesting to see how the AWS people attempt to make amends.

tomkarlo · on April 21, 2011

"Commercially reasonable" is a standard legal term used to define efforts short of "best efforts". It allows for the party also look out for its own commercial interest in a way that's consistent with industry practice. So, for example, if Amazon had to choose between fulfilling the SLA and keeping it's own retail site up, it could be held liable under a "best efforts" standard but not under a "commercially reasonable" standard.

It's kind of unfair to describe these as "weasel words" when it's unlikely that any decent lawyer would let them sign up to something that exposes them to more liability than this. Customers who are using any cloud service provider have to expect reasonable steps to maintain availability, not an absolute promise.