Yeah, I don't think I'd go with less than RAID-6 (or full system redundancy plus 1 drive redundancy in each). Rebuilds just take too long, even with an in-chassis spare on RAID5.
Unfortunately Areca is really the only controller I've found which is well supported and does RAID6 fast.
would those be managed at that price? because it's a hell of a lot more expensive when you factor in the cost of devops to make sure it stays working and fails over properly.
I know what you mean. I have a lot of issues with AWS, but the AWS console is exactly what my manager needs so he can do things himself. Simple things such as AWS load balancing fails when we get any decent amount of traffic.
Luckily my RDS wasn't affected, but ELB merrily sent traffic to the affected zone for 30 minutes. (Either that or part of the ELB was in the affected zone and was not removed from rotation.)
We pay a lot to stay multi-AZ and it seems Amazon keep finding ways to show us their single points of failures.
Similar thing happened to me a while ago with a vendor. When your management team summons you to ask why the hell their site is down, you can't point fingers at the vendor if their marketing literature says it doesn't go down.
If you don't host your data in several alternative dimensions so that the same events wouldn't transpire in all of them - why not assume you'll encounter the occasional outage?
Did/does your standby replica in another AZ have any instance notifications stating there is a failure? The outage report claims there were just EBS problems in only one AZ.
No, nothing unusual with our standby replica. It's not even clear if it was the standby or our primary that was in the affected AZ.
Multi-AZ RDS does synchronous replication to the standby instance -- I'm guessing something broke in there. Hopefully AWS will update with a post mortem as they usually do. Lots of frustrated MultiAZ RDS customers on their forums.
Yeah unfortunately it looks to be an EBS problem and if your underlying EBS volume housing your primary DB instance takes a dump then that is unfortunately going to cause replication to fall over too
Multi-AZ RDS deployment is supposed to protect you from that though. That's why it's 2x the price. We should have failed over to a different AZ w/o EBS issues.
If your source EBS volume is horked then you aren't going to be replicating any data to your backup host while the EBS volume is messed up (since your source data is unavailable). EBS volumes also don't cross/failover between AZ boundaries.
Maybe there was something bad with your replication server before the outage? It's hard to guess without knowing exactly what was happening at the time...
The whole point is to protect you from problems in one AZ by keeping a hot standby in another AZ. It doesn't matter whether it's due to EBS, power, etc. This is one of the primary reason to use RDS instead of running MySQL yourself on an instance.
Yes...what also sounds plausible is that since this was an EBS outage that the underlying EBS volume wasn't detected as being unavailable (if it in fact did become unavilable) so no failover to your other RDS server was initiated.
Every time (two out of two), by the time I click on "X is down" link, the service/website is working again. Surely there is a better platform for alerting about outages than ycombinator?
I was down for approximately three hours this morning. I don't know when this submission was posted, but I made one shortly after discovering the outage myself.
Either way, if you're using RDS, even if this didn't affect you, it's discussion-worthy. I was affected, and we're building a not-yet-launched product that allows us the time to consider "Is Amazon really where we want to be?". The more failure I'm aware of, the more informed that decision is.
Pingdom does a good job of it, if you point it at a public-facing web site you particularly care about. I'm not affiliated with them; I've just been woken up by them.
I got notified by Pingdom that my domain was down before AWS had any info on that status page of theirs. IMHO, they should improve on the latency of their alerts.
Same here. In fact, the AWS dashboard was still showing 2/2 checks passed for some 20 minutes after Pingdom told me my site was down.
Then the AWS dashboard finally updated and told me that 3 minutes ago my instances became unreachable. That is pretty poor. AWS should be able to know right away and email me themselves.
SNS sent me an e-mail of my instance alarms pretty quickly.
EDIT: My status checks were slow to update like the sibling comment stated, although the alarms that measure system resources triggered almost immediately when everything blew up. I think the status checks refresh at a certain interval, but those aren't really meant for real-time monitoring AFAIK.
EC2 comes with a free Chaos Monkey service. It's called EC2.
I know, they're trying to make it reliable and they've got a bunch of very hard problems to solve. That doesn't change the fact that sometimes some of my servers just permanently stop responding to pings until you stop-start them, or get crazy-slow I/O, or get hit by these once-in-a-while-and-always-at-night outages.
It's great when you suddenly need a hundred more servers, though.
I feel like you can't really say you're in the green when you still have customers unable to use your service. My instance is still stuck in failover.
"9:39 AM PDT Networking connectivity has been restored to most of the affected RDS Database Instances in the single Availability Zone in the US-EAST-1 region. New instance launches are completing normally. We are continuing to work on restoring connectivity to the remaining affected RDS Database Instances."
9:32 AM PDT Connectivity has been restored to the affected subset of EC2 instances and EBS volumes in the single Availability Zone in the US-EAST-1 region. New instance launches are completing normally. Some of the affected EBS volumes are still re-mirroring causing increased IO latency for those volumes.
I'm still seeing issues, some instances that aren't starting, and others I'm still not able to connect to. So I'm not sure what they are talking about.
It probably is the most used, being a cheaper alternative to us-west, but are you suggesting it fails more because it is used more? It does seem that the big AWS outages (in the us) have been concentrated in us-east. I have wondered if it just because us-east is newer so they haven't had has much time to work things out, or that the us-west team is a little better?
edit: btw, I am not dismissing "used more" as a valid theory. More use = more hardware = more complexity which could lead to more failures.
Same in that they are both AWS and sometimes generate errors - yes. Not the same in that East has had four significant outages in the last 16 months and West has not.
Issue #3298392 for EC2 this month. This is ridiculous, so many websites rely on EC2 and it's proving to be extremely unreliable. Cloud computing is definitely not the answer to everything it would seem.
Cpu0 : 0.3%us, 0.0%sy, 0.0%ni, 0.0%id, 99.7%wa, 0.0%hi, 0.0%si, 0.0%st <-- EBS subsystem is completely unreachable. I/O wait times are tanked across the board for me (I'm in US-EAST-1).
Ahh. Now that you mention it, I think I recall reading that before. It struck me as weird though because 1a was obviously the first one and 1e was recently added. So, would they rebalance my labels in that case?
Yeah, for me, 1d experiences the lowest load of all zones. According to the pricing history for spot instances, 1d experiences the fewest price spikes compared to 1a and 1b. I'd be interested to see if other users have noticed the same thing for their zones.
I find that random(5) is the best performing. For okay but consistent performance, random(5) is decent, but you should definitely avoid random(5) due to high load.
Well, they also return errors if a zone is out of capacity. That seems like it would guard against the issue a bit. But maybe they just don't want to have to field loads of questions about that.
I have instances in two different zones - both are down, although I don't know if AWS's randomization means that my 1a and 1d are actually located in the same logical zone.
Actually, if you do the normalization to make it apples to apples (and adjust for the difference in RAM) it looks price competitive. My numbers make it look slightly more expensive than AWS EAST (teh suck) and slightly less expensive than AWS WEST.