Amazon EC2 and RDS in US-EAST zone down

mikebo · on June 29, 2012

Worst part of this outage: paying for a multi-az RDS instance and having failover totally, completely, fail.

keithnoizu · on June 29, 2012

I'm paying like 2,300 a month and even something basic like failover isnt working. I'm not happy.

shiftpgdn · on June 29, 2012

At $2300/month you could redundantly colo or lease VERY powerful servers in 3-4 data centers around the country.

dialtone · on June 29, 2012

Except when you have to factor in all the plane flights to replace broken HDD. And the risk of not making it in time for when it breaks.

shiftpgdn · on June 29, 2012

Most colo facilities let you buy hands on time through their techs or include a small amount per month for things like hard drive/ram swaps.

rdl · on June 29, 2012

Yeah, I don't think I'd go with less than RAID-6 (or full system redundancy plus 1 drive redundancy in each). Rebuilds just take too long, even with an in-chassis spare on RAID5.

Unfortunately Areca is really the only controller I've found which is well supported and does RAID6 fast.

bgentry · on June 29, 2012

would those be managed at that price? because it's a hell of a lot more expensive when you factor in the cost of devops to make sure it stays working and fails over properly.

keithnoizu · on June 30, 2012

Poor inherited architecture, working to scale out greatnonprofits.org horizontally but it will be a while before we get there.

  I have nothing against colo but I don't really have time to run around the country checking on servers.

nadahalli · on June 29, 2012

I feel for you :-(

Amazon is not cheap, and they have failed way too many times in recent memory.

But the api, oh the api - it's crack, and I can't live without it.

RegEx · on June 29, 2012

I know what you mean. I have a lot of issues with AWS, but the AWS console is exactly what my manager needs so he can do things himself. Simple things such as AWS load balancing fails when we get any decent amount of traffic.

on June 29, 2012

[deleted]

ceejayoz · on June 29, 2012

I suspect it's the "all the things you can do with it" part, not the format. Using the SDKs you don't see any of the underlying ugly, anyways.

nadahalli · on June 29, 2012

Thanks for clarifying my statement. Boto ftw.

werkshy · on June 29, 2012

Luckily my RDS wasn't affected, but ELB merrily sent traffic to the affected zone for 30 minutes. (Either that or part of the ELB was in the affected zone and was not removed from rotation.)

We pay a lot to stay multi-AZ and it seems Amazon keep finding ways to show us their single points of failures.

malachismith · on June 29, 2012

Do we all agree that we are completely over AWS-EAST now? It's NOT worth the cost savings.

rabbitfang · on June 29, 2012

The Oregon (us-west-2) region is the same price as the Virginia (us-east-1) region.

gouranga · on June 29, 2012

That sucks badly.

Similar thing happened to me a while ago with a vendor. When your management team summons you to ask why the hell their site is down, you can't point fingers at the vendor if their marketing literature says it doesn't go down.

Sticky situation.

TazeTSchnitzel · on June 29, 2012

Can't you tell management that it isn't as reliable as they claim?

gouranga · on June 29, 2012

I did. Unfortunately in the financial services industry, believing it means taking responsibility for it.

its_so_on · on June 29, 2012

If you don't host your data in several alternative dimensions so that the same events wouldn't transpire in all of them - why not assume you'll encounter the occasional outage?

gouranga · on June 30, 2012

If only people understood that fact. Unfortunately few do.

res0nat0r · on June 29, 2012

Did/does your standby replica in another AZ have any instance notifications stating there is a failure? The outage report claims there were just EBS problems in only one AZ.

mikebo · on June 29, 2012

No, nothing unusual with our standby replica. It's not even clear if it was the standby or our primary that was in the affected AZ.

Multi-AZ RDS does synchronous replication to the standby instance -- I'm guessing something broke in there. Hopefully AWS will update with a post mortem as they usually do. Lots of frustrated MultiAZ RDS customers on their forums.

res0nat0r · on June 29, 2012

Yeah unfortunately it looks to be an EBS problem and if your underlying EBS volume housing your primary DB instance takes a dump then that is unfortunately going to cause replication to fall over too

mikebo · on June 29, 2012

Multi-AZ RDS deployment is supposed to protect you from that though. That's why it's 2x the price. We should have failed over to a different AZ w/o EBS issues.

res0nat0r · on June 29, 2012

If your source EBS volume is horked then you aren't going to be replicating any data to your backup host while the EBS volume is messed up (since your source data is unavailable). EBS volumes also don't cross/failover between AZ boundaries.

Maybe there was something bad with your replication server before the outage? It's hard to guess without knowing exactly what was happening at the time...

mikebo · on June 29, 2012

I don't think you're familiar with how Multi AZ RDS works: http://aws.amazon.com/rds/faqs/#36

The whole point is to protect you from problems in one AZ by keeping a hot standby in another AZ. It doesn't matter whether it's due to EBS, power, etc. This is one of the primary reason to use RDS instead of running MySQL yourself on an instance.

res0nat0r · on June 29, 2012

Yes...what also sounds plausible is that since this was an EBS outage that the underlying EBS volume wasn't detected as being unavailable (if it in fact did become unavilable) so no failover to your other RDS server was initiated.

tolos · on June 29, 2012

Every time (two out of two), by the time I click on "X is down" link, the service/website is working again. Surely there is a better platform for alerting about outages than ycombinator?

bmelton · on June 29, 2012

I was down for approximately three hours this morning. I don't know when this submission was posted, but I made one shortly after discovering the outage myself.

Either way, if you're using RDS, even if this didn't affect you, it's discussion-worthy. I was affected, and we're building a not-yet-launched product that allows us the time to consider "Is Amazon really where we want to be?". The more failure I'm aware of, the more informed that decision is.

pjscott · on June 29, 2012

Pingdom does a good job of it, if you point it at a public-facing web site you particularly care about. I'm not affiliated with them; I've just been woken up by them.

gregholmberg · on June 29, 2012

Individual availability zones can be identified using the API.

   ec2-describe-reserved-instances-offerings --region

will tell you what the zone's identifier is.

After you list the permanent identifiers, you can match them up to find out if your us-east-1a matches my -1d.

This Alestic article shows how to label them all.

[0] "Matching EC2 Availability Zones Across AWS Accounts" http://alestic.com/2009/07/ec2-availability-zones

dwhsix · on June 29, 2012

Keep in mind AZs are different per account. My us-east-1b is not necc'ly your us-east-1b (as someone reminded me on twitter just now).

bad_user · on June 29, 2012

I got notified by Pingdom that my domain was down before AWS had any info on that status page of theirs. IMHO, they should improve on the latency of their alerts.

NathanKP · on June 29, 2012

Same here. In fact, the AWS dashboard was still showing 2/2 checks passed for some 20 minutes after Pingdom told me my site was down.

Then the AWS dashboard finally updated and told me that 3 minutes ago my instances became unreachable. That is pretty poor. AWS should be able to know right away and email me themselves.

RegEx · on June 29, 2012

I've learned to ignore the checks passed for quite a while, especially for servers on load balance.

iharris · on June 29, 2012

SNS sent me an e-mail of my instance alarms pretty quickly.

EDIT: My status checks were slow to update like the sibling comment stated, although the alarms that measure system resources triggered almost immediately when everything blew up. I think the status checks refresh at a certain interval, but those aren't really meant for real-time monitoring AFAIK.

keithnoizu · on June 29, 2012

By over fifteen minutes in my case. Possibly thirty. WTH.

pjscott · on June 29, 2012

EC2 comes with a free Chaos Monkey service. It's called EC2.

I know, they're trying to make it reliable and they've got a bunch of very hard problems to solve. That doesn't change the fact that sometimes some of my servers just permanently stop responding to pings until you stop-start them, or get crazy-slow I/O, or get hit by these once-in-a-while-and-always-at-night outages.

It's great when you suddenly need a hundred more servers, though.

keithnoizu · on June 29, 2012

I feel like you can't really say you're in the green when you still have customers unable to use your service. My instance is still stuck in failover.

"9:39 AM PDT Networking connectivity has been restored to most of the affected RDS Database Instances in the single Availability Zone in the US-EAST-1 region. New instance launches are completing normally. We are continuing to work on restoring connectivity to the remaining affected RDS Database Instances."

gooeyblob · on June 29, 2012

Absolutely agree - that's just silly. Their status page is close to useless.

pearle · on June 29, 2012

I'm running in us-east-1 and my EC2 instances and EBS volumes are still responding ok for the moment...

Fingers crossed (just deployed to AWS less than 2 weeks ago).

mattwdelong · on June 29, 2012

It's not entirely down as I can still access my instances. I'm in us-east-1b.

grourk · on June 29, 2012

Your us-east-1b might be my us-east-1a.

KenCochrane · on June 29, 2012

9:32 AM PDT Connectivity has been restored to the affected subset of EC2 instances and EBS volumes in the single Availability Zone in the US-EAST-1 region. New instance launches are completing normally. Some of the affected EBS volumes are still re-mirroring causing increased IO latency for those volumes.

KenCochrane · on June 29, 2012

I'm still seeing issues, some instances that aren't starting, and others I'm still not able to connect to. So I'm not sure what they are talking about.

bad_user · on June 29, 2012

For what is worth, my small website is online again.

pearle · on June 29, 2012

Anyone have any details on why us-east-1 seems to be less reliable than the other regions? Is it the oldest?

jaylevitt · on June 29, 2012

According to this calculation (which attempted to probe all the racks in EC2), over 70% of EC2 lives in us-east.

http://huanliu.wordpress.com/2012/03/13/amazon-data-center-s...

sausagefeet · on June 29, 2012

I'm under the impression it's the most used.

NoPiece · on June 29, 2012

It probably is the most used, being a cheaper alternative to us-west, but are you suggesting it fails more because it is used more? It does seem that the big AWS outages (in the us) have been concentrated in us-east. I have wondered if it just because us-east is newer so they haven't had has much time to work things out, or that the us-west team is a little better?

edit: btw, I am not dismissing "used more" as a valid theory. More use = more hardware = more complexity which could lead to more failures.

rabbitfang · on June 29, 2012

There are two different us-west regions. One in Oregon (priced the same as us-east) and one in California.

sausagefeet · on June 29, 2012

My theory is "used more".

malachismith · on June 29, 2012

It's the oldest, yes.

rdl · on June 29, 2012

I'm curious why no public paas is multiple AWS region.

malachismith · on June 29, 2012

1) because AWS East is so much cheaper (and none of us like spending money) 2) AppFog actually is multi region (and multi IaaS as well)

kanwisher · on June 29, 2012

Oregon is same as AWS East, seems to have a smaller set of boxes, have gotten errors in the past about not having any more servers to allocate.

malachismith · on June 29, 2012

Same in that they are both AWS and sometimes generate errors - yes. Not the same in that East has had four significant outages in the last 16 months and West has not.

rdl · on June 29, 2012

I'd tolerate multi-AZ as a baseline.

Thanks for AppFog -- I hadn't heard of them, but will check them out.

zedwill · on June 29, 2012

Interesting enough not only the EBS is down, but ELB can not register instances even if there are not EBS based and completely operational.

I have some live instances running without EBS disks that I can not place behind the ELB as it is not working.

oasisbob · on June 29, 2012

I have some live instances running without EBS disks that I can not place behind the ELB as it is not working.

ELBs are sometimes EBS backed.

DigitalSea · on June 29, 2012

Issue #3298392 for EC2 this month. This is ridiculous, so many websites rely on EC2 and it's proving to be extremely unreliable. Cloud computing is definitely not the answer to everything it would seem.

stevefink · on June 29, 2012

Cpu0 : 0.3%us, 0.0%sy, 0.0%ni, 0.0%id, 99.7%wa, 0.0%hi, 0.0%si, 0.0%st <-- EBS subsystem is completely unreachable. I/O wait times are tanked across the board for me (I'm in US-EAST-1).

nirvdrum · on June 29, 2012

What zone? I really wish Amazon would provide that info, instead of saying that it only affects one zone.

gabrtv · on June 29, 2012

AFAIK zones are randomized. 1a for me is 1d for you.

stevefink · on June 29, 2012

That's really interesting if it's true. I had never heard this before. Thanks for the tip.

nirvdrum · on June 29, 2012

Ahh. Now that you mention it, I think I recall reading that before. It struck me as weird though because 1a was obviously the first one and 1e was recently added. So, would they rebalance my labels in that case?

mattwdelong · on June 29, 2012

Do you know why this is?

RKearney · on June 29, 2012

It was done due to badly written tools and scripts firing up instances in Availability Zone A every time.

blantonl · on June 29, 2012

probably to prevent folks from all stacking up in a single AZ.

Just think if someone posted a blog post saying "I've noticed that EBS performance is far better in 1d vs. 1a"

iharris · on June 29, 2012

Yeah, for me, 1d experiences the lowest load of all zones. According to the pricing history for spot instances, 1d experiences the fewest price spikes compared to 1a and 1b. I'd be interested to see if other users have noticed the same thing for their zones.

biot · on June 29, 2012

I find that random(5) is the best performing. For okay but consistent performance, random(5) is decent, but you should definitely avoid random(5) due to high load.

nirvdrum · on June 29, 2012

Well, they also return errors if a zone is out of capacity. That seems like it would guard against the issue a bit. But maybe they just don't want to have to field loads of questions about that.

gdb · on June 29, 2012

Supposedly it's for load-balancing purposes. (Most people spin up their machines in 1a.)

lytfyre · on June 29, 2012

Humans are predictable - it wouldn't spread the load across zones very well.

dwhsix · on June 29, 2012

I believe it's also a security measure.

aaronharder · on June 29, 2012

Unfortunately, there is no meaningful way for them to say which zone because zone labels are different for each user.

iharris · on June 29, 2012

I have instances in two different zones - both are down, although I don't know if AWS's randomization means that my 1a and 1d are actually located in the same logical zone.

res0nat0r · on June 29, 2012

The zones map differently per account, but if you've launched instances in different zones for your same account you are for sure in different AZ's.

stevefink · on June 29, 2012

Both my MySQL master (I'm not using RDS) and Redis Master servers are affected and are located in zone us-east-1a.

kokon · on June 29, 2012

Why do you have two masters in the same AZ?

zedwill · on June 29, 2012

It is affecting me on US-EAST-1B

rabble · on June 29, 2012

Good time to consider Google's Compute Engine as an alternative? What will we call it, GCE?

jfoutz · on June 29, 2012

currently, it is a limited beta. Also, it looks to be more expensive.

malachismith · on June 29, 2012

Actually, if you do the normalization to make it apples to apples (and adjust for the difference in RAM) it looks price competitive. My numbers make it look slightly more expensive than AWS EAST (teh suck) and slightly less expensive than AWS WEST.

rabbitfang · on June 29, 2012

us-west-2 (Oregon) has identical pricing to us-east-1 (Virginia).

vachi · on June 29, 2012

ahh Acronyms

mattbillenstein · on June 29, 2012

I suggest until Amazon uses RDS For their database - that you don't either...

anuraj · on June 29, 2012

Mine is okay

ahmedaly · on June 29, 2012

dotcloud was down also but its now up. (they rely on ec2)

ahmedaly · on June 29, 2012

My instances are not down too.. I will back it up now in case things go bad.

NathanKP · on June 29, 2012

I am experiencing two out of four instances in us-east-1e unreachable.

misiti3780 · on June 29, 2012

my instances in us-east-1c are fine

malachismith · on June 29, 2012

Goat rodeo.

cupcake_death · on June 29, 2012

Yep - Forums are exploding