Amazon EC2 currently down. Affecting Heroku, Reddit, Others

pg · on Oct 22, 2012

Convenient that we're too backward to use AWS. That means everyone can at least talk about it here when AWS is down.

toomuchtodo · on Oct 22, 2012

Funny thing is, the last couple interviews I've had in Chicago and Silicon Valley, I actually get points when explaining caution is necessary when using AWS for production.

A magic bullet it isn't.

i386 · on Oct 23, 2012

I can't feel that Amazon are a bit of a Cassandra (mythological not the software) when these outages occur.

They recommend that people failover to other availability zones but no one puts any effort into doing it then they get annoyed when a datacenter goes offline.

Its not Amazons fault that you didn't make your service failure tolerant - its your fault!

hexix · on Oct 23, 2012

I'm seeing a lot of these type of comment. The thing is, AWS completely crapped out. Don't believe their status updates that make it sound like it was a tiny little area of their data center. It was pretty much the entire zone and then whenever there is an outage affecting an entire zone it brings down global services and even other zones as well.

We had servers in the bad zone and started having load issues. When I went to use the cool cloud features that are made for this, the entire thing completely fell on its face. I couldn't launch new EC2 servers either because the API was so bogged down, or the new zone I was launching in was restricted because of load.

Basically, the thing that nobody keeps in mind when they think it's so cool that you can spin up servers to work around outages is that EVERYONE IS DOING THAT. This is Amazon's entire selling point and when it comes to doing it, it doesn't work!

We were lucky to get some new servers launched before the API pretty much completely went down. They started giving everyone errors saying request limit exceeded. The forums were full of people asking about it.

ELB, Elastic IP, and other services not associated with a single availability zone completely failed. I keep seeing comments saying that if people designed their stuff right, they wouldn't have an issue. That's just completely bull, AWS has serious design flaws and they show up at every outage. It's NOT just people relying on a single zone.

kndonlee · on Oct 23, 2012

Totally Agree. A lot of people don't know this, or substitute alternatives which are not necessarily viable. Among the tenants of reliability is isolation. The nature of Amazon's services is that it isolates at the datacenter level. One should isolate at the level in which they are comfortable taking on failures. Once there is an active dependency, a la EBS, the number of subsystems increase multi-fold and the likelihood of failure & cascading failure dramatically increases.

Where getting a bit from disk to memory used to be: platter -> diskcontroller -> cpu -> memory,

now with SANs & NFS & virtualized block storage, it's: platter -> diskcontroller -> cpu -> memory -> nic -> wire -> switch(es)/router(s)/network configs(human config item) -> wire -> nic -> cpu -> memory.

Not to say that centralized storage doesn't have its benefits, but now the scope of isolation has drastically increased, which when considering the combinatorial possibilities of failure in the prior scenario vs the latter, the latter has a significantly larger chance and mode of failure that is significantly more difficult to programmatically automate failover.

TLDR: With amazon, the scope is isolation is the datacenter. To be on amazon, one must architect and design at the scope of handling failure at the datacenter level, rather than at the host or cluster level.

dialtone · on Oct 23, 2012

We didn't have downtime for various reasons but the ELBs we were using failed and the queue of starting instances was too big to see our few ones restarting.

The main systematic issue in EC2 is EBS, take that away and it will almost completely remove downtimes.

seldo · on Oct 23, 2012

The problem with ELBs is that they are themselves EC2 instances and use many of the global services for detecting load, scaling up, etc.. Like all of AWS's value-added services, they are therefore more likely to fail during an outage event, not less likely, as they depend on more services.

1SaltwaterC · on Oct 23, 2012

Right. Blame it on the victim. How do you make a "fault tolerant" service when core services like ELB together with the API behind it start to fail? Multi-region? Multi-cloud? When is it "designed to make web-scale computing easier" part supposed to kick in? With half-baked producs like ELB or things like EIP that cease to work when you need them the most?

I actually asked the AWS Premium support regarding the ELB multi-AZ issues, in order to actually make things easier for everyone. This is the answer I got:

"As it stands right now, you would need to make a call to ELB to disable the failed AZ. It may be possible for you to programatically/script this process in the case of an event.

Going forward, this is something that we would like to address but I don't have any ETA for when something like this might be implemented."

kndonlee · on Oct 23, 2012

IMO, there's plenty of blame to go around, however the onus really should be on the individuals that are making the decision to go on Amazon and trust that there services will always be up. Unfortunately, some people don't know, so they will just blindly choose Amazon for their name recognition.

For the places that truly care about reliability and have the technical staff to make informed decisions, they should understand the limits of reliability with various architectures. As I mentioned before, one of the tenant of reliability is isolation. When the scope of isolation is increased (e.g. single host vs multi host), one must also handle failures at that scope. Amazon isolates at the datacenter level. So should those utilizing Amazon's offerings.

btyrad · on Oct 23, 2012

My question is this...why doesn't a service like Heroku which acts as a Paas...have this built in? I'm on their blog now trying to understand their complete setup..

mgkimsal · on Oct 23, 2012

perhaps they should make it easier to do so, as in having some default option you could select, at a premium of course.

i386 · on Oct 23, 2012

Perhaps they should - though Rightscale.com provides a service that helps you do just that. (Disclaimer: I do not work for Rightscale or know anyone that works there)

drivebyacct2 · on Oct 23, 2012

Just like: passwords, backups, etc.

It reminds me of my father: I used to interrupt him with "I know dad!" when he was chastising me. His response was simple "If you knew, then why did[n't] you do it?"

dredmorbius · on Oct 22, 2012

I most certainly is a magic bullet.

Slays millions with a single round.

acdha · on Oct 22, 2012

It's bizarre the way "cloud" makes so many people think disks never fail, networks are perfect and data centers always run smoothly. Now we'll get the backlash blog posts from people ditching the cloud – and I'm jousting waiting for the inevitable rebound outages when they learn that high availability requires geographic redundancy either way.

Supreme · on Oct 23, 2012

You find it bizarre that the cloud providers' marketing strategy has worked? I find that bizarre!

acdha · on Oct 23, 2012

That's just silly given how much of Amazon's documentation strongly encourages you to use multiple AZs and regions for reliability. I've included a sampling of their whitepapers below; this is also what their salespeople tell you and what you have to click past any time you provision an RDS instance, ELB, etc.

http://media.amazonwebservices.com/AWS_Cloud_Best_Practices....

“Be a pessimist when designing architectures in the cloud”

http://media.amazonwebservices.com/AWS_Web_Hosting_Best_Prac... “As the AWS web hosting architecture diagram in this paper shows, we recommend that you deploy EC2 hosts across multiple Availability Zones to make your web application more fault-tolerant.”

http://media.amazonwebservices.com/AWS_Operational_Checklist...

“We have deployed critical components of our applications across multiple availability zones, are appropriately replicating data between zones, and have tested how failure within these components affects application availability”

Supreme · on Oct 23, 2012

Fair enough, I stand corrected. The problem is in the culture surrounding cloud services, not with the providers themselves.

So it seems that the only real benefit to utilising cloud services is to make scaling up easier and save money.

acdha · on Oct 23, 2012

I think the main problem is that "the cloud"'s primary audience are developers, not sysadmins. Many developers simply don't appreciate that what you're getting is a [much] easier path to automating your server provisioning and management but you're still in exactly the same position as before regarding any bit of infrastructure's ability to fail at the least convenient moment.

ukd1 · on Oct 22, 2012

Caution and understanding is necessary when choosing any infrastructure provider; AWS is not a special case.

creatrice · on Oct 22, 2012

Can we trust the cloud services?

johnrob · on Oct 22, 2012

Downtime is inevitable for most companies. The only question is how much work you want to put in to achieve the standard up time.

31reasons · on Oct 22, 2012

Whats the alternative? Building your own is certainly not.

strlen · on Oct 23, 2012

As a start, be present in multiple EC2 availability zones (not just US-east-2, basically) and regions (this is harder). Cross-region presence needn't be active-active, just a few read-only database slaves and some machines to handle SSL termination ("points of presence") for your customers on the east coast. Perform regular "fire drills" where you actually fail over live traffic and primary databases from one AZ/one region to another.

"Building your own" is also something very few people (including Amazon itself up until fairly late, probably after the IPO) do: you can use a managed hosting provider (very common, usually cheaper than EC2) or lease colo space (which doesn't imply maintaining on-site personnel in the leased space: most colos provide "remote hands"). You can still use EC2 for async processing and offline computation, S3 for blob storage, etc... or even S3 for "points of presence" on different US coasts, Asia/Pacific, Europe, but run databases, et al in a leased colo or a managed hosting provider.

Yes, these options are more expensive than running a few instances in a single EC2 AZ: but that's the price of offering high availability SLA to your customers. It's a business decision.

toomuchtodo · on Oct 23, 2012

We run gear in multiple physical locations, but both the application and data is stored/backed in S3. This allows us the redundancy of S3 without the cost and fragile nature of EC2.

strlen · on Oct 23, 2012

Indeed, there are many ways to complement physical colocation with AWS.

mrkurt · on Oct 23, 2012

Old school colo/dedicated servers/etc. There's something delightfully simple about only having to deal with "standard" hardware failures.

vidarh · on Oct 23, 2012

Not to mention unless you have very unusual traffic patterns (spin up lots of servers for short periods of time), colo/dedicated servers will usually be vastly cheaper than EC2, especially because with a little bit of thought you can get servers that are substantially better fit for your use.

E.g. I'm currently about to install a new 2U chassis in one of our racks. It holds 4 independent servers each with with dual 6 core 2.6GHz Intel CPUs, 32GB RAM and a SSD RAID subsystem that easily gives a 500MB/sec throughput.

Total leasing cost + cost of a half rack in that data centre + 100Mbps of bandwidth is ~ $2500/month. Oh, and that leaves us with 20U of space for other servers, so every additional one adds $1500/month for the next 7-8 or so of them (when counting some space for switches and PDU's). Amortized cost of putting 2U with 100Mbps in that data centra is more like $1700/month.

Amazon doesn't have anything remotely comparable in terms of performance. To be charitable to EC2, at the low end we'd be looking at 4 x High Mem Quadruple Extra Large instances + 4 x EBS volumes + bandwidth and end up in the $6k region (adding the extra memory to our servers would cost us an extra $100-$200/month in leasing cost, but we don't need it), but the EBS IO capacity is simply nowhere near what we see from a local high end RAID setup with high end SSD's, and disk IO is usually our limiting factor. More likely we'd be looking at $8k-$10k to get anything comparable through a higher number of smaller instances).

I get that developers like the apparent simplicity of deploying to AWS. But I don't get companies that stick with it for their base load once they grow enough that the cost overhead could easily fund a substantial ops team... Handling spikes or bulk jobs that are needed now and again, sure. As it is, our operations cost in man hours spent, for 20+ chassis across two colo's is ~$120k/year. $10k/month or $500/per chassis. So consider our fully loaded cost per box at ~$2200k/month for quad-server chassis of the level mentioned above with reasonably full racks. Lets say $2500 again to be charitable to EC2...

This is with operational support far beyond what Amazon provides, as it includes time from me and other members of staff that knows the specifics of our applications, handles backups, handles configuration and deployment etc.

I've so far not worked on anything where I could justify the cost of EC2 for production use for base load, and I don't think that'll change anytime soon...

cstejerean · on Oct 23, 2012

If disk performance is important you can also take a look at the High IO instances, which give you 2x 1TB SSDs, 60GB of RAM and 35 ECUs across 16 virtual cores. At 24x7 for 3 years you end up with ~$656/mo per instance, plus whatever you would need for bandwidth. By the time you fill up an entire rack it still ends up being slightly more expensive than your amortized 2U cost, but you also don't need to scale it up in 2U increments.

speleding · on Oct 23, 2012

Completely agree, building your own is cheaper, gives more control, etc. But what is more: you do NOT lose the ability to use the cloud for added reliability: it is pretty cheap to have an EC2 instance standing by that you fail over to.

If you are very database heavy, and you want to be able to replicate that to the cloud in real time it does get expensive, but if you can tolerate a little downtime while the database gets synced up and the instances spin up that's cheap too.

toomuchtodo · on Oct 30, 2012

We have SQL Server 2008 boxes with 128GB+ of ram; we're able to run all of our production databases right out of memory. This would be cost-prohibitive in a virtualize environment such as AWS, Linode, etc.

cookiecaper · on Oct 23, 2012

Did you know that many websites operated BEFORE Amazon Web Services existed? Perhaps going back to 2008 could give us some ideas for alternate deployment methodologies...

outside1234 · on Oct 23, 2012

Why not?

http://blogs.technet.com/b/privatecloud/archive/2012/03/12/l...

31reasons · on Oct 23, 2012

Well the reason why seed money is so low these days is because people expect you to not spend all the money on making your own cloud.

vidarh · on Oct 23, 2012

For the very early stage, perhaps. Once you're dealing with more than a handful of instances, it is extremely likely you'd save a substantial amount of money moving your base load off EC2.

seiji · on Oct 23, 2012

Building your own what?

EGreg · on Oct 22, 2012

I wish that DNS could just switch over to another availability zone when this happens. A datacenter with all the sql replicated. Sure you'd pay twice as much for EBS, but it could also double as a backup. As for the other resources at the availability zone, they are hardly utilized until they spring into action (EC2, etc.)

gphil · on Oct 22, 2012

"...but it could also double as a backup."

Take care when treating a high-availability set up such as this as a back up: if you are replicating all the changes between 2 database servers and an application error (e.g. not a database outage) causes some kind of data corruption, you are hosed if the corruption replicates and you don't also have some previous "snapshot" of the data that you can roll back to.

ukd1 · on Oct 22, 2012

You can do this, depending on your database. I wrote a blog post going over some of the techniques you can use with the AWS stack a while back - http://bit.ly/TD13iH

Splines · on Oct 22, 2012

I'm not familiar with HN's technical stack (other than arc), how has it scaled as the community grew over the years?

pg · on Oct 22, 2012

Barely.

Nothing has changed in the stack. Robert has discovered and eliminated a series of bottlenecks, causing performance to oscillate about tolerable. Finding bottlenecks is not trivial, because Arc has zilch in the way of profiling, but fortunately Robert is good at this sort of thing.

rdl · on Oct 22, 2012

I think it's a case of "well optimized code which does as little as possible can handle an awful lot of users on a single beefy box".

adrianbg · on Oct 22, 2012

Until it can't. Then five years later you get Spanner.

rdl · on Oct 22, 2012

You can buy some seriously big boxes, and easily split off a lot of services onto multiple boxes. The big problem with the "single big box" strategy is being able to do upgrades -- I see hn go down frequently for 5-10 min at a time in the middle of the night, which I assume is upgrades/reboots.

The happy medium is probably splitting database (master/slave at least) and cdn (if needed) and some other services (AAA? logging?) out, and then having 2+ front end servers with load balancing for availability.

blasdel · on Oct 23, 2012

The Arc process hosting HN blows up at least once an hour (I wouldn't be surprised if there was a cronjob restarting it) and much more frequently in peak usage periods.

You wouldn't notice if it weren't for the use of closures for every form and all pagination, every time the process dies all of them are invalid (except in the rare case that they lead to a random new place!).

There's no database, everything is in-memory loaded on-demand from flat files. That wouldn't be so bad except that it's all then addressed by the memory locations rather than the content identifiers! There can be only one server per app, and to keep it real interesting PG hosts all the apps on the same box, during YC application periods he regularly limits HN to keep the other apps more available.

ukd1 · on Oct 22, 2012

hn doesn't have a database capable of master/slave as such...so I think this will be harder if it ever becomes popular enough. I don't think it gets enough traffic it's ever likely to exceed what you can fit in a single box, from what I know.

forgotusername · on Oct 22, 2012

Couldn't help but do a little digging..

In the first 99 comments of this page, average comment text size is 231 bytes. Counting all comments in articles on the front page right now, there's 1678 of them, making somewhere around 388kb of comments for the past 12 hours.

So for safety's sake round that to 1mb/day and multiply by site age (5 years).

That gets us 1825mb, projecting forward it's difficult to imagine a time when a single recent SSD on a machine with even average RAM wouldn't be able to handle all of HN's traffic needs. Considering the recent beefy Micron P320h and its 785kIOP/sec, that could serve the entire comment history of Hacker News to the present day once every 2 seconds, assuming it wasn't already occupying a teensy <2gb of RAM.

Even if Arc became a burden, a decent NAS box, gigabit Ethernet, and a few front end servers would probably take the site well into the future. Assuming exponential growth, Hacker News comments would max out a 512GB SSD sometime around 2020, or 2021 assuming gzip bought a final doubling.

jauer · on Oct 23, 2012

Clearly pg should release the dataset and institute a annual round of hn golf where participants compete by recreating hn and trying to get the best performance for a given (changing) deployment target (SSD vs HDD, different RAM & CPU).

nialo · on Oct 23, 2012

HN traffic just needs to grow more slowly than computing power, which seems reasonably likely

modarts · on Oct 23, 2012

Well, for one; the site is basically an engine for rendering out <table><tr><td><a> ..., without much in the way of complex and frequent client-side requests.

(this isn't to downplay the challenges faced by scaling a site with the amount of traffic HN gets)

awayand · on Oct 23, 2012

never fix a working system right?

_qcmz · on Oct 23, 2012

Backwards are the sites with unplanned downtime.

diego · on Oct 22, 2012

The N. Virginia datacenter has been historically unreliable. I moved my personal projects to the West Coast (Oregon and N. California) and I have seen no significant issues in the past year.

N. Virginia is both cheaper and closer to the center of mass of the developed world. I'm surprised Amazon hasn't managed to make it more reliable.

nostromo · on Oct 22, 2012

One thing we discovered this morning: it appears the AWS console itself is hosted in N Virginia.

This means that if you were trying to make changes to your EC2 instances in the West using the GUI, you couldn't, even though the instances themselves were unaffected.

mgkimsal · on Oct 22, 2012

shouldn't amazon themselves have architected their own app to be able to move around?

I get tired of the snipes from people that "well, you're doing it wrong", as if this is trivial stuff. But if Amazon themselves aren't even making their AWS console redundant between locations, how easy/straightforward is it for anyone else?

To what extent is this just "the cobbler's kids have no shoes?"

calinet6 · on Oct 22, 2012

If it's systematically difficult to do it correctly, then the system is wrong.

jamesaguilar · on Oct 22, 2012

. . . or the problem is inherently complex.

Terretta · on Oct 22, 2012

> . . . or the problem is inherently complex.

You're close. Put another way, "inherent complexity is the problem."

What I mean by that is, the more your system is coupled, the more it is brittle.

Frankly, this is AWS's issue. It is too coupled: RDS relies on EBS, the console relies on both, etc. Any connection between two systems is a POF and must be architected to let those systems operate w/o that connection. This is why SMTP works the way it does. Real time service delivery isn't the problem, but counting on it is.

Uncouple all the things!

jamesaguilar · on Oct 22, 2012

Depends. Generic interfaces and non-reliance have costs too. In general I agree that things should be decoupled, but it's not always easy or practical.

calinet6 · on Oct 22, 2012

Surely true, but that's the purpose of a system in the first place: to manage complexity and make it predictable. You could argue that we have such a system in place, given how well the Internet works overall. The fact that this system has problems goes against what I believe is fully evident proof that such a system can, in fact, work even better.

We're not talking about a leap in order of magnitude of complexity here—just simple management of common human behavioral tendencies in order to promote more reliability. "The problem is inherently complex" is always true and will always be true, but it's no excuse for not designing a system to gracefully handle that complexity.

jamesaguilar · on Oct 22, 2012

The internet works because it provides very weak consistency guarantees compared to what businesses might require out of an EC2 management console. (IMO.)

IheartApplesDix · on Oct 23, 2012

That's what twilio + Heroku are for, abstract up another layer. There's even a site where you just give it a github location and it does the rest.

jeremyjh · on Oct 23, 2012

Well the Heroku abstraction was leaking like a sieve today.

IheartApplesDix · on Oct 23, 2012

Hardly

pkill17 · on Oct 22, 2012

Their CoLo space is the same space shared by AOL and a few other big name tech companies. It's right next to the Greenway, just before you reach IAD going northeast. That CoLo facility seems pretty unreliable in the scheme of things; Verizon and Amazon both took major downtime this summer when a pretty hefty storm rolled through VA[1], but AOL's dedicated datacenters in the same 10 mile radius all experienced no downtime whatsoever.

Edited: [1] http://www.datacenterknowledge.com/archives/2012/06/30/amazo...

ohashi · on Oct 22, 2012

To be fair, the entire region was decimated by that storm. I didn't have power for 5 days. Much of the area was out. There was a ton of physical damage. That's not excusing them, they should do better, but that storm was like nothing I've experienced living in the area for 20 years.

acangiano · on Oct 22, 2012

Realistically, it's at least in part because everyone defaults to the East region. So it's the most crowded and demanding on the system.

obeattie · on Oct 22, 2012

Yep, according to the most recent estimate I saw[1], us-east was more than twice the size of all other regions combined.

[1] http://huanliu.wordpress.com/2012/03/13/amazon-data-center-s...

0xbadcafebee · on Oct 22, 2012

It's not just because it's crowded. Everyone I know who's worked in that DC hated it. Aside from that, storms regularly knock out the grid in NoVa.

jey · on Oct 22, 2012

Yeah, it's got to be much larger than the other regions, so it makes sense that we see more errors. Since error_rate = machines * error_rate_per_machine.

nasmorn · on Oct 22, 2012

The whole region is down, you just calculated the chance of at least one machine having an error.

jey · on Oct 22, 2012

No, I calculated the error rate for the region. If us-east-1 has 5 times the machines (or availability zones, or routers, or EBS backplanes, or other thing-that-can-fail) as us-west-1, we would expect to see us-east-1 have each type of error occur about 5 times as often as us-west-1.

merlincorey · on Oct 22, 2012

I believe this is because North Virginia is also their historical first facility.

ceejayoz · on Oct 22, 2012

And largest, and busiest.

notatoad · on Oct 22, 2012

I'm surprised amazon hasn't built another region in the east. If you're in the west you get US-West-1 and US-West-2 and can failover and distribute between the two, why don't they have that kind of duplication in the east?

dsl · on Oct 22, 2012

Stop thinking about regions as datacenters.

us-east-1 was 11 different datacenters last time I bothered to check.

us-west-2 by comparison is two datacenters. The reason west-1 and west-2 exist is because they are geographically diverse enough to prevent low latency inner-connections (and also have dramatically different power costs so they bill differently).

notatoad · on Oct 22, 2012

then how come when east goes down, it always seems to take down all the AZs in the region, never just one AZ? As long as the region fails like a single datacenter, i'll think of it like a single datacenter.

0xbadcafebee · on Oct 22, 2012

They already expanded into a DC in Chantilly, one more in Ashburn and I believe one in Manassas. But they lean on Ashburn for everything they do, and a small problem results in a daisy-chain failure (which because everyone uses Amazon for every service imaginable, means even the smallest problem takes down whole websites)

donrhummy · on Oct 22, 2012

I don't understand why anyone's site is only in one datacenter. i thought the point of AWS was that it was distributed with fault tolerance? Why don't they distribute all the sites/apps across all their centers?

fragsworth · on Oct 22, 2012

It takes development/engineering resources, and additional hardware resources to make your architecture more fault-tolerant and to maintain this fault-tolerance over long periods of time.

Weigh this against the estimated costs of your application going down occasionally. It's really only economical for the largest applications (Netflix, etc.) to build these systems.

gazarsgo · on Oct 22, 2012

disagree. the only area it really hurts the wallet is multi-AZ on your RDS, because it doubles your cost no matter what and RDS is toughest to scale horizontally. The upside is if you scale your data layer horizontally you don't need to use RDS anymore.

two c1.medium, which are very nice for webservers, are enough to host >1M pageviews a month (wordpress, not much caching) and cost around $120/mo each, effective $97/mo if you prepay for 12months at a time via reserved instances.

piggity · on Oct 22, 2012

The other issue is that you can have redundant services, but when the control plane goes down - you are screwed.

Every day I have to build basic redundancy into my applications I wish that we could just go with a service provider (like Rackspace / Contegix) that offered more redundancy at the hardware level.

I know the cloud is awesome and all, but having to assume your disks will disappear, fail, go slow at random uncontrollable times is expensive to design around.

If you don't have an elastic load, then the cloud elasticity is pointless - and is ultimately an anchor around your infrastructure.

gazarsgo · on Oct 22, 2012

heroku only uses one AZ, apparently. Which is completely awful, for a PaaS...

zmonkeyz · on Oct 22, 2012

They sell it as a feature.

donavanm · on Oct 23, 2012

Us-west-2 is about the same cost as us-east these days. And latency is only ~10ms more than us-west-1. I'm puzzled that people aren't flocking to us-west-2. I can't the last time there was an outage there either.

g3orge · on Oct 22, 2012

You can move your projects on demand?

btilly · on Oct 22, 2012

As https://twitter.com/DEVOPS_BORAT says, At conference you can able tell cloud devops by they are always leave dinner for respond to pager.

Also, What is happen in cloud is stay in cloud because nobody can able reproduce outside of cloud.

(And many other relevant quotes.)

xaa · on Oct 22, 2012

Possibly the most relevant:

"Source of Amazon is tell me best monitoring strategy is watch Netflix. If is up, they can able blame customer. If is down, they are fuck."

engtech · on Oct 22, 2012

so true: "In devops is turtle all way down but at bottom is perl script. "

https://twitter.com/DEVOPS_BORAT/status/248770195580125185

eloisant · on Oct 22, 2012

To be fair, AWS downtime always make the news because they affect a lot of majors websites, but that doesn't mean an average sysadmin (or devops, whatever) would do better in term of uptime with his own bay and his toys.

dennisgorelik · on Oct 22, 2012

SoftLayer that we use seems to be much more reliable than Amazon. At least more reliable than that particular Amazon's datacenter in Virginia.

x3sphere · on Oct 22, 2012

I agree, been with SL for 3 years, never had an outage apart from a drive failure one time but it was fixed within an hour.

apendleton · on Oct 22, 2012

But this is part of the problem: we have multiple web properties, and the fact that AWS issues can affect all of them at once is a huge downside. Certainly, if we ran on metal, we would have hardware fail, but failures would be likely to be better-isolated than at Amazon.

olalonde · on Oct 23, 2012

@override: You are hellbanned.

toomuchtodo · on Oct 22, 2012

When our gear is down, we can actually get into the datacenter to fix it.

What do you do when Amazon is down other than sweat?

SkyMarshal · on Oct 22, 2012

1. Calculate the odds that a company with the resources of Amazon will be able to provide you better overall uptime and fault tolerance than you yourself could.

2. Calculate the cost of moving to the Oregon AWS datacenter.

3. Reassure your investors that outsourcing non-core competencies is still the way to go.

4. Try to, er, control, your inner control-freak.

;)

Deestan · on Oct 23, 2012

> we can actually get into the datacenter to fix it.

But better and faster than Amazon?

I'd rather spend three hours at home saying "Shit. Well, we'll just wait for Amazon to fix that", than dropping my dinner, driving to the datacenter, and spend three hours setting up a new instance and restoring from backup.

dredmorbius · on Oct 22, 2012

When our datacenter is down (cause of our last two outages), we can actually get into the datacenter ... and watch them fix it. Or not.

darrencauthon · on Oct 22, 2012

There's not much sweating to do, as it always comes up relatively quickly.

toomuchtodo · on Oct 23, 2012

How long was Amazon AWS "degraded" today?

acdha · on Oct 23, 2012

2 minutes if you checked the "multi-AZ" box on your RDS instances or ELBs.

jasonkester · on Oct 22, 2012

For added fun, their EC2 console is down. I got this for a while:

  <html><body><b>Http/1.1 Service Unavailable</b></body> </html>

... then an empty console saying "loading" for the last 20 minutes. Then recently it upgraded to saying "Request limit exceeded." in place of the loading message (because hey, I'd refreshed the page four times over the course of 20 minutes).

On the upside, their status page shows all green lights.

rubyron · on Oct 22, 2012

They've acknowledged that at http://status.aws.amazon.com/ for awhile now with a tiny "i" status icon (I can't load my instances pane):

12:07 PM PDT We are experiencing elevated error rates with the EC2 Management Console.

mrb · on Oct 23, 2012

Amazon misrepresents reality on this status page!

They have standardized icons to represent various levels of issues (orange = perf issue, red = disruption). But they don't even use them. Instead they add [i] to the green icons to indicate perf issues (Amazon Elastic Compute Cloud - N. Virginia) and disruptions of service (Amazon Relational Database Service - N. Virginia).

Maybe this status page is controlled by marketing bozos who want to pretend the situation is not so bad.

adrianbg · on Oct 22, 2012

Does Amazon explain anywhere what regions, availability zones or whatever have single points of failure for each of their services? I guess that's supposed to be what an "availability zone" is but somehow that doesn't quite seem to capture it. It's pretty hard to build reliable apps without knowing where the single points of failure are in the underlying infrastructure.

jmvoodoo · on Oct 22, 2012

All of our EC2 hosts appear to be functioning fine, but they can't connect to their RDS instance which renders our app useless. If you scroll down the page you'll also see that RDS instances are having connectivity issues. Not sure if it's related but for RDS users the impact is far worse.

EDIT: We are also using multi-AZ RDS, so either Amazon's claims for multi-AZ are bs, or their claims that this is only impacting a single zone is bs.

spartango · on Oct 22, 2012

Because RDS is built on EBS, any slight issue with EBS manifests itself as a nastier issue for RDS.

Interestingly, EBS will never return an I/O error up to the attached OS, which is likely a good decision as most OSes choke on disk errors. What this means, however, is that if something even get slow within EBS (let alone stuck), applications that are dependent on it will suffer. Most of these applications (such as databases) have connection/response timeouts for their clients, so while EBS might just be running slowly, a service like RDS will throw up connection errors instead of waiting even a bit more.

You can imagine the cascading errors that might result from such a situation (instance looks dead, start failover...etc)

Banduin · on Oct 22, 2012

Our multi-zoned RDS instance was able to fail over to another zone with minimal downtime. It took about 2 minutes.

jmvoodoo · on Oct 22, 2012

Lucky you? :)

modsearch · on Oct 22, 2012

Our multi-AZ RDS instance did not failover correctly this time although it has in the past..

rynop · on Oct 22, 2012

same here, no fail-over

jmccree · on Oct 22, 2012

I had one multi-az failover correctly, however the security group was refusing connections to the web servers ec2 security group. I had to manually add in the private ips of the ec2 instances. It appears the API issue is affecting security group to ip lookups.

abailin · on Oct 22, 2012

Same here, our servers are still up, but our RDS instance is down.

ojiikun · on Oct 22, 2012

Per the linked dashboard, some instances in a single AZ in a single Region are having storage issues. Calling EC2 "down" is a bit dramatic, provided AMZN are being sincere with their status reports. Any system that can competently fail over to another AZ will be unaffected.

coderintherye · on Oct 22, 2012

I would agree with you, but Amazon is just downright dishonest in their reports, which makes me sad, because I love Amazon. Go look at the past reports, they've never shown a red market, only "degraded performance" even when services for multiple availability zones went down at the same time due to their power outage (so had you architected to multiple AZs you were still fucked). When they have a single AZ go down, they won't even give it a yellow marker on the status page, they'll just put a footnote on a green marker. It makes their status dashboard pretty much useless for at a glance checking (why even have colors if they don't mean anything?)

Read their report from the major outage earlier this year, they start out by saying "elevated error rates", when many services were in fact down, and it wasn't until hours later they finally admitted to having an issue that affected more than just one availability zone.

From Forbes: ”We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.” By 11:49 EST, it reported that, ”Power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online.” But by 12:20 EST the outage continued, “We are continuing to work to bring the instances and volumes back online. In addition, EC2 and EBS APIs are currently experiencing elevated error rates.” At 12:54 AM EST, AWS reported that “EC2 and EBS APIs are once again operating normally. We are continuing to recover impacted instances and volumes.”

kalid · on Oct 22, 2012

It's like grade inflation. You can never give out an F (Mr. Admissions officer, are you so bad at your job that you would admit such an unqualified student?), so a Gentleman's C is handed around. In Amazon's case, it's a gentleman's B+ (green, with an info icon).

A: fine A-: problems B+: servers are on fire

I really like Amazon as a company, use a lot of their services, but this is dishonest.

mibbitier · on Oct 22, 2012

They reserve the red mark for "The world just exploded". Orange is reserved for "The datacenter got bombed".

driverdan · on Oct 22, 2012

Amazon is downplaying the problem. It is affecting many large sites.

res0nat0r · on Oct 22, 2012

Many large sites which aren't properly architecting their infrastructure to deal with the commodity nature of the cloud.

mey · on Oct 22, 2012

http://downrightnow.com/netflix

If netflix is down, then it's something most companies who know how to design fail over can't cope with.

res0nat0r · on Oct 22, 2012

The last Netflix post mortem mentioned they had a bug in their configuration where they kept sending traffic to already down ELB instances, which was the cause of the last outage for them if I remember correctly.

Uchikoma · on Oct 22, 2012

AirBnB is also down.

creatrice · on Oct 22, 2012

Yeap, it's down the day I really need it!

incision · on Oct 22, 2012

Not necessarily, it could be some element of the Netflix architecture that due to their size and/or design trade-offs has taken longer / is harder to eliminate than it would be for others.

Other services, like Twilio, have come through several of these major problems with US-EAST generally unscathed while Netflix has had issues repeatedly.

iamdave · on Oct 22, 2012

Netflix seems to be operating just fine for me in the southwest USA.

acdha · on Oct 22, 2012

According to a site which doesn't document what its reports are based on. Given that Netflix worked for me during that period, I'm suspicious that downrightnow might be using EBS somewhere.

dsl · on Oct 22, 2012

I'm seeing issues across multiple AZ's. Everything is still up, but that doesn't stop me from getting paged.

darkarmani · on Oct 22, 2012

Including companies like Amazon that aren't properly architecting their control plane.

jeremyjh · on Oct 22, 2012

They should outsource it.

milkshakes · on Oct 22, 2012

too bad their own dashboard and management interface don't fail over... http://db.tt/BcuoSnPu

sehugg · on Oct 22, 2012

I'm also having API and Console timeouts, so I would disagree that it's limited to a single AZ.

ceejayoz · on Oct 22, 2012

The API and Console timeouts are likely due to high load from everyone logging in to see what's going on.

run4yourlives · on Oct 22, 2012

Heh, the cynical side of me would like to point out that this is a great way to get people to stop talking about wiping a user's kindle. :-)

macrael · on Oct 22, 2012

This kind of thinking is poisonous. I know it's in good fun and it's fun to look for connections in things, but it is actually preposterous to think that Amazon would purposely disrupt wide swaths of highly paying customers for much of a day to bury one story about bad customer relations. My guess is there are a lot of people working very hard to try and solve this problem right now, let's not belittle their efforts because of a conspiracy, let's belittle their efforts because of bad systems design.

Angostura · on Oct 22, 2012

It was a joke. I made the same joke earlier today. No-one is seriously going to believe this.

xenophanes · on Oct 22, 2012

He knows you were joking. It's a bad joke. "It's a joke" is not a magic bullet that means you can do no wrong.

Poisonous ideas spread as jokes. That is one of the ways they spread. A person thinking well about the issue wouldn't find the joke funny because it doesn't make sense. The joke relies on some poisonous, bad thinking to be understood. It has bad assumptions, and a bad way of looking at the world, built in.

Angostura · on Oct 22, 2012

Let me explain to you why it is a funny joke. It is funny because it involves Amazon undertaking massive technical measures, with huge reputational damage in order to try to kill a story which is primarily not spreading via Amazon-hosted sites anyway.

It's akin to a man with athlete's foot deciding to remedy it by discharging a shotgun into his leg.

xenophanes · on Oct 23, 2012

It's easy to understand shooting a leg with a shotgun. That's a simple thing.

The Amazon thing in question is far more complicated, and far harder to understand.

Thinking they are "akin" is a mistake. It shows you're thinking about it wrong and failing to recognize how completely different they are.

One isn't going to confuse anyone or be misunderstood, the other will confuse most of the population and be misunderstood by most people.

One, if someone misunderstood, only involves one individual being an idiot. The other involves a large company being evil and thus can help feed conspiracy theories.

I'm not sure if you are aware of the difficulty of Amazon doing this. Suppose Jeff Bezos wants to do it. He can't simply order people to do it because they will refuse and leak it to the media and he'll look really bad and then he'll definitely have to make sure to try super hard for there to be no outages anytime soon.

Shooting yourself in the foot is stupid but easy. Doing this is stupid and essentially impossible. To think it's possible requires thinking that Amazon has a culture of unthinking obedience, or has an evil culture that all new hires are told about and don't leak to the media. Totally different.

Casually talking about impossible, evil conspiracies by big business, as if they are even possible, is a serious slander against those businesses, capitalism, and logic. Slandering a bunch of really good things -- especially ones that approximately 50% of US voters want to harm -- and then saying "it's just a joke, it's funny" is bad.

EternalLight · on Oct 23, 2012

Casually joking about impossible, evil conspiracies by big business on the other hand is something completely different and also funny.

No one will believe its related and its certainly not slander to joke about it. Also you might want to leave the political opinions out of hacker news... there is no 50% of US who dislikes those things, they only have different ideas about how to support it.

EwanToo · on Oct 23, 2012

I think it was a pretty good joke, it's funny specifically because amazon obviously won't be doing it deliberately, no poisonous thinking required.

RegEx · on Oct 22, 2012

HNers are extremely bad at getting jokes. See http://news.ycombinator.com/item?id=4677335

tuple · on Oct 22, 2012

actually, there's probably some few idiots who will believe it.

run4yourlives · on Oct 22, 2012

I'd like to think the smiley face would make that obvious, but I guess not.

macrael · on Oct 23, 2012

When I commented the two other comments replying to yours were taking it seriously, so I decided to take it seriously too. Text is a pretty bad medium for hearing tone, sorry I misinterpreted yours; for some reason the "the cynical side of me" bit made me think your comment wasn't entirely in jest.

Cheers

UnoriginalGuy · on Oct 22, 2012

Well if that was their goal then it has been a success as they have taken out a lot of the forums and discussion sites which were critical of them.

Except HN.

IheartApplesDix · on Oct 22, 2012

Because reddit and ycombinator are the center of the tech sociosphere?

farinasa · on Oct 22, 2012

For the less specialized groups, kind of, yeah.

incision · on Oct 22, 2012

Unfortunate.

I have to deal with a number of folks who will be overjoyed to read this news when their tech cartel vendor of choice forwards it this evening.

There's a huge contingent of currently endangered infrastructure folks (and vendors who feed off them) out there who throw a party every time AWS has a visible outage.

rdl · on Oct 22, 2012

AWS sucking at availability (and especially specific parts like EBS, and then services built on top of EBS and on top of AWS) doesn't mean the correct option is to mine your own sand and dig up your own coal to run servers in your own datacenters in the basement.

Even if you're totally sold on the cloud, you can still have a requirement that things be transparent all the way down. AWS is one of the least transparent hosting options around.

If you're a customer of a regular colo, or even a managed hosting provider themselves based at a colo, it's pretty easy to dig into how the infrastructure is set up, identify areas where you need your own redundancy, etc. Essentially impossible within AWS -- there is no reason intelligible to me that ELB in multiple AZs should depend on EBS in a single AZ, but that's how they have it set up.

druiid · on Oct 22, 2012

I'd question if they are truly wrong to send out news like this? Basically you have to weigh the realities of the 'cloud' with the perception.

The perception that many people has is that somehow the 'cloud' is a magical up-time device that will save you money in droves.

The reality is that for many companies with mid-level traffic and aren't a start-up with a billion users, the 'old' style tech might very well be your best option.

tisme · on Oct 23, 2012

> who throw a party every time AWS has a visible outage.

At the going rate they'll be AA members before long.

benwerd · on Oct 22, 2012

This would be a great time to post a guide to architecting systems for failover using AWS. Anyone got a great guide?

caseysoftware · on Oct 22, 2012

We (Twilio) have released a number of articles & presentations in this area:

http://www.slideshare.net/twilio/highavailability-infrastruc...

http://www.twilio.com/engineering/2011/04/22/why-twilio-wasn...

It's strategy as opposed to how-to but the principles apply.

caseysoftware · on Oct 22, 2012

And Netflix's Simian Army is awesome:

http://techblog.netflix.com/2011/07/netflix-simian-army.html

They've even released the "Chaos Monkey" open source: http://techblog.netflix.com/2012/07/chaos-monkey-released-in...

wyck · on Oct 22, 2012

Yet netflix is currently down, so maybe there is a problem with the chimp army?

bermanoid · on Oct 22, 2012

The problem is likely the same as usual: if the damn control plane is down, it doesn't matter how robust your failover architecture is, because your requests to bring up new machines go unanswered.

There's pretty much no way to architect around that one as an AWS user (apart from going fully multi-cloud, but "nobody" actually does that, at least at scale), and I'm kind of shocked that those bits of AWS are still not robust against "single AZ outages", given that they're involved in pretty much every one of these incidents and make them affect people on the entire cloud...

koide · on Oct 22, 2012

> apart from going fully multi-cloud, but "nobody" actually does that, at least at scale

Pirate Bay might disagree with that sentence: http://torrentfreak.com/pirate-bay-moves-to-the-cloud-become...

taligent · on Oct 22, 2012

Apparently iCloud is multi cloud (AWS and Azure).

But regardless it's not like all of EC2 went down just one or two AZs. So why couldn't traffic be migrated transprently to other AZs/regions ?

nathannecro · on Oct 22, 2012

Netflix not down for me here (Eastern US).

caseysoftware · on Oct 22, 2012

I'm watching Netflix from Austin, so it's not entirely down in the US.

j45 · on Oct 22, 2012

Sweet, thanks for sharing!

jedberg · on Oct 22, 2012

We (Netflix) have done a bunch of presentations on it which are on our slideshare page and across the internet.

After this issue is over I can give a longer answer. In short, we've just evacuated the affected zone and are mostly recovered.

klapinat0r · on Oct 22, 2012

Netflix recently launched here, and I've been unaffected so far.

And +1 for the slideshare page.

Their techblog is also worth following: http://techblog.netflix.com/

typicalrunt · on Oct 22, 2012

I'm guessing your Chaos Gorilla helped to harden your architecture against this threat.

Since you've mostly recovered, how did your system do? Are there side-cases that Chaos Gorilla didn't touch?

ohashi · on Oct 22, 2012

EDIT: I WAS WRONG. Chaos Monkey and Chaos Gorilla both exist and simulate different forms of chaos.

zackzackzack · on Oct 22, 2012

"Create More Failures

Currently, Netflix uses a service called "Chaos Monkey" to simulate service failure. Basically, Chaos Monkey is a service that kills other services. We run this service because we want engineering teams to be used to a constant level of failure in the cloud. Services should automatically recover without any manual intervention. We don't however, simulate what happens when an entire AZ goes down and therefore we haven't engineered our systems to automatically deal with those sorts of failures. Internally we are having discussions about doing that and people are already starting to call this service "Chaos Gorilla"."

http://techblog.netflix.com/2011_04_01_archive.html

ohashi · on Oct 22, 2012

My apologies! I was wrong.

onethumb · on Oct 22, 2012

Chaos Monkey takes down instances and such. Chaos Gorilla takes down entire AZs. :)

philwelch · on Oct 22, 2012

Monkey takes down single hosts; Gorilla simulates the loss of an entire AZ.

benwerd · on Oct 25, 2012

I'd love to hear that longer answer, fwiw!

hkarthik · on Oct 22, 2012

I wish there was. Architecting a system for failover requires a mindset change, a culture change within your org, and the right technology that lets you build for it while not slowing yourself down too much. The last point being the hardest.

I would argue that none of the common full stack frameworks that startups use are fault tolerant enough for AWS. Most of them have multiple failure points that can quickly bring down entire apps.

ultrasaurus · on Oct 22, 2012

PagerDuty's CTO gave a talk that mentions it: http://blog.pagerduty.com/2012/10/ensuring-the-call-goes-out...

pla3rhat3r · on Oct 22, 2012

petejansson · on Oct 22, 2012

The reference architectures and white papers here are helpful: http://aws.amazon.com/architecture/

dave1619 · on Oct 22, 2012

Our app is down because it's hosted on Heroku and it's frustrating because it seems like N Virginia is the least reliable Amazon datacenter. Every year it seems to go down a couple times for at least a couple hours.

Heroku should offer a choice between N Virginia and Oregon hosting (I think they're almost comparable in price nowadays). That way people who want more uptime/reliability can choose Oregon. Sure it will be further from Europe (but then it will be closer to Asia) and people can make that choice on their own.

But basing an entire hosting service on N Virginia doesn't make sense anymore, considering the history of major downtime in that region.

nlh · on Oct 22, 2012

Or better yet, Heroku should offer an add-on "instant failover" service that, for a premium of course, offers a package for multi-site (or, knowing they're 100% AWS, multi-datacenter) deployment with all of the best practices, etc. Seems like a logical next step for them (or a competitor) given the recent spate of outages.

kellyhclay · on Oct 22, 2012

Can't update my list fast enough, but other major services experiencing problems are Netflix and Pinterest. Lots of other (smaller) sites are starting to fail too.

http://www.forbes.com/sites/kellyclay/2012/10/22/amazon-aws-...

rdl · on Oct 22, 2012

Why the fuck do the two most critical services (ELB and Console) have depends on their historically most unreliable pile of shit (EBS)?

taligent · on Oct 22, 2012

Seriously THIS has to be addressed.

I can tolerate EC2/EBS going down but why on earth is ELB/Console always going down at the same time ?

fredoliveira · on Oct 22, 2012

One of the most frustrating issues here is that we have to deal with Amazon's status page for information. It's a complex page, divided by continent instead of region, which means at least 5 or 6 clicks to figure out progress. They should learn from these issues about how people want to be informed - to date, they haven't. Also, they have a twitter account, which would be the perfect fit to keep everyone up-to-date with what's going on; to at least show a human side to these issues. Alas, they're not updating that either.

I've been working with AWS since early 2006 when they first launched - I was lucky to be granted a VIP invite to try out EC2 before everyone else, and ended up launching the first public app on EC2. This might be the first time when frustration has overcome my love for these guys.

orourkek · on Oct 22, 2012

It's degraded performance for some EBS volumes in a single availability zone - isn't this title a bit sensationalistic?

drone · on Oct 22, 2012

"Degraded performance," is a fairly off-handed way of putting what they're experiencing. First RDS connectivity went down the tube, and then EBS followed, finally the EC2 console is failing to operate properly (for me, in US-EAST region at least).

Of course, as soon as I read the report that the issue was confined to one AZ, I looked to move my server over to another AZ. Oh yes, two were full and refused new instances, and then surprisingly, new requests for the other AZs never were received or operated on - and now the console is failing. It's a bit more than just "slow EBS" if that's what you were thinking.

- edit: said ec2 twice in the second sentence, corrected to say ebs.

thomaslutz · on Oct 22, 2012

No. Amazon just understates the issues. Reddit, Heroku etc. all have problems.

orourkek · on Oct 22, 2012

Even if they do understate the issue, It's limited to US-EAST 1, and is an EBS issue. Saying that EC2 is "down" because of this is totally off the mark - I've got dozens of EBS volmes in EAST 1 that are unaffected, plus all of the other zones that are operating normally...

thomaslutz · on Oct 22, 2012

EC2 instances are affected by the EBS issues. But you're right, it's not correct that EC2 is "down".

onenine · on Oct 22, 2012

Not all instances are EBS backed.

evan_ · on Oct 22, 2012

EC2 is not down. My EC2 sites are all functioning.

enjo · on Oct 22, 2012

All my instances across 3 zones are functioning normally.

trimbo · on Oct 22, 2012

From what I'm seeing, if your root disk is on EBS and your SSH keys are there, you cannot SSH into those hosts right now.

Also, the availability zones are disparate in terms of what they can support. A great number of my instances are in 1d because of unavailability in others.

ConstantineXVI · on Oct 22, 2012

Note that the region-1[a b c d & such] designations are randomized per account; my us-east-1d won't (necessarily) be the same as yours or anyone else's.

trimbo · on Oct 22, 2012

Ah, I didn't know that. Thanks!

acdha · on Oct 22, 2012

If you're relying on individual servers to have drives which never fail you're doing reliability wrong.

gtaylor · on Oct 22, 2012

Does anyone know if this really is just one AZ? Seems like an awful lot of larger sites are down. I'd expect at least some of them to be multi-AZ by now.

rynop · on Oct 22, 2012

I'm multi-az and am being impacted. I also use multi-az RDS adn its being impacted. So calling BS on the 1 az impact.

ejdyksen · on Oct 22, 2012

One of our Multi-AZ RDS instances failed over successfully. It was down for about 5-10 minutes.

colinhowe · on Oct 22, 2012

Looks like multi-AZ to us too

dsl · on Oct 22, 2012

Multiple AZs here.

jeaguilar · on Oct 22, 2012

This is affecting more than a single Availability Zone, but probably for reasons that have been seen before. One reason might be that an EBS failure in one AZ triggers a spike in EBS activity in other AZs which overwhelms EBS. (I believe this is what happened in April 2011).

Does anybody have any experience with migrating to Oregon or N. California in terms of speed and latency?

ceejayoz · on Oct 22, 2012

My dashboard says "Degraded EBS performance in a single Availability Zone". It then lists each of the five zones as "Availability zone is operating normally." http://cl.ly/image/202F3B0I371g

dirtae · on Oct 22, 2012

I was seeing that too, but it looks like Amazon has now updated the availability zone status. When I run ec2-describe-availability-zones from the command line, it's telling me that us-east-1b is impaired. (Availability zone mapping is different for each account, so my 1b may be some other availability zone for you.)