Funny thing is, the last couple interviews I've had in Chicago and Silicon Valley, I actually get points when explaining caution is necessary when using AWS for production.
I can't feel that Amazon are a bit of a Cassandra (mythological not the software) when these outages occur.
They recommend that people failover to other availability zones but no one puts any effort into doing it then they get annoyed when a datacenter goes offline.
Its not Amazons fault that you didn't make your service failure tolerant - its your fault!
I'm seeing a lot of these type of comment. The thing is, AWS completely crapped out. Don't believe their status updates that make it sound like it was a tiny little area of their data center. It was pretty much the entire zone and then whenever there is an outage affecting an entire zone it brings down global services and even other zones as well.
We had servers in the bad zone and started having load issues. When I went to use the cool cloud features that are made for this, the entire thing completely fell on its face. I couldn't launch new EC2 servers either because the API was so bogged down, or the new zone I was launching in was restricted because of load.
Basically, the thing that nobody keeps in mind when they think it's so cool that you can spin up servers to work around outages is that EVERYONE IS DOING THAT. This is Amazon's entire selling point and when it comes to doing it, it doesn't work!
We were lucky to get some new servers launched before the API pretty much completely went down. They started giving everyone errors saying request limit exceeded. The forums were full of people asking about it.
ELB, Elastic IP, and other services not associated with a single availability zone completely failed. I keep seeing comments saying that if people designed their stuff right, they wouldn't have an issue. That's just completely bull, AWS has serious design flaws and they show up at every outage. It's NOT just people relying on a single zone.
Totally Agree. A lot of people don't know this, or substitute alternatives which are not necessarily viable. Among the tenants of reliability is isolation. The nature of Amazon's services is that it isolates at the datacenter level. One should isolate at the level in which they are comfortable taking on failures. Once there is an active dependency, a la EBS, the number of subsystems increase multi-fold and the likelihood of failure & cascading failure dramatically increases.
Where getting a bit from disk to memory used to be: platter -> diskcontroller -> cpu -> memory,
now with SANs & NFS & virtualized block storage, it's:
platter -> diskcontroller -> cpu -> memory -> nic -> wire -> switch(es)/router(s)/network configs(human config item) -> wire -> nic -> cpu -> memory.
Not to say that centralized storage doesn't have its benefits, but now the scope of isolation has drastically increased, which when considering the combinatorial possibilities of failure in the prior scenario vs the latter, the latter has a significantly larger chance and mode of failure that is significantly more difficult to programmatically automate failover.
TLDR: With amazon, the scope is isolation is the datacenter. To be on amazon, one must architect and design at the scope of handling failure at the datacenter level, rather than at the host or cluster level.
We didn't have downtime for various reasons but the ELBs we were using failed and the queue of starting instances was too big to see our few ones restarting.
The main systematic issue in EC2 is EBS, take that away and it will almost completely remove downtimes.
The problem with ELBs is that they are themselves EC2 instances and use many of the global services for detecting load, scaling up, etc.. Like all of AWS's value-added services, they are therefore more likely to fail during an outage event, not less likely, as they depend on more services.
Right. Blame it on the victim. How do you make a "fault tolerant" service when core services like ELB together with the API behind it start to fail? Multi-region? Multi-cloud? When is it "designed to make web-scale computing easier" part supposed to kick in? With half-baked producs like ELB or things like EIP that cease to work when you need them the most?
I actually asked the AWS Premium support regarding the ELB multi-AZ issues, in order to actually make things easier for everyone. This is the answer I got:
"As it stands right now, you would need to make a call to ELB to disable the failed AZ. It may be possible for you to programatically/script this process in the case of an event.
Going forward, this is something that we would like to address but I don't have any ETA for when something like this might be implemented."
IMO, there's plenty of blame to go around, however the onus really should be on the individuals that are making the decision to go on Amazon and trust that there services will always be up. Unfortunately, some people don't know, so they will just blindly choose Amazon for their name recognition.
For the places that truly care about reliability and have the technical staff to make informed decisions, they should understand the limits of reliability with various architectures. As I mentioned before, one of the tenant of reliability is isolation. When the scope of isolation is increased (e.g. single host vs multi host), one must also handle failures at that scope. Amazon isolates at the datacenter level. So should those utilizing Amazon's offerings.
My question is this...why doesn't a service like Heroku which acts as a Paas...have this built in? I'm on their blog now trying to understand their complete setup..
Perhaps they should - though Rightscale.com provides a service that helps you do just that. (Disclaimer: I do not work for Rightscale or know anyone that works there)
It reminds me of my father: I used to interrupt him with "I know dad!" when he was chastising me. His response was simple "If you knew, then why did[n't] you do it?"
It's bizarre the way "cloud" makes so many people think disks never fail, networks are perfect and data centers always run smoothly. Now we'll get the backlash blog posts from people ditching the cloud – and I'm jousting waiting for the inevitable rebound outages when they learn that high availability requires geographic redundancy either way.
That's just silly given how much of Amazon's documentation strongly encourages you to use multiple AZs and regions for reliability. I've included a sampling of their whitepapers below; this is also what their salespeople tell you and what you have to click past any time you provision an RDS instance, ELB, etc.
“Be a pessimist when designing architectures in the cloud”
http://media.amazonwebservices.com/AWS_Web_Hosting_Best_Prac...
“As the AWS web hosting architecture diagram in this paper shows, we recommend that you deploy EC2 hosts across multiple Availability Zones to make your web application more fault-tolerant.”
“We have deployed critical components of our applications across multiple availability zones, are appropriately replicating data between zones, and have tested how failure within these components affects application availability”
I think the main problem is that "the cloud"'s primary audience are developers, not sysadmins. Many developers simply don't appreciate that what you're getting is a [much] easier path to automating your server provisioning and management but you're still in exactly the same position as before regarding any bit of infrastructure's ability to fail at the least convenient moment.
As a start, be present in multiple EC2 availability zones (not just US-east-2, basically) and regions (this is harder). Cross-region presence needn't be active-active, just a few read-only database slaves and some machines to handle SSL termination ("points of presence") for your customers on the east coast. Perform regular "fire drills" where you actually fail over live traffic and primary databases from one AZ/one region to another.
"Building your own" is also something very few people (including Amazon itself up until fairly late, probably after the IPO) do: you can use a managed hosting provider (very common, usually cheaper than EC2) or lease colo space (which doesn't imply maintaining on-site personnel in the leased space: most colos provide "remote hands"). You can still use EC2 for async processing and offline computation, S3 for blob storage, etc... or even S3 for "points of presence" on different US coasts, Asia/Pacific, Europe, but run databases, et al in a leased colo or a managed hosting provider.
Yes, these options are more expensive than running a few instances in a single EC2 AZ: but that's the price of offering high availability SLA to your customers. It's a business decision.
We run gear in multiple physical locations, but both the application and data is stored/backed in S3. This allows us the redundancy of S3 without the cost and fragile nature of EC2.
Not to mention unless you have very unusual traffic patterns (spin up lots of servers for short periods of time), colo/dedicated servers will usually be vastly cheaper than EC2, especially because with a little bit of thought you can get servers that are substantially better fit for your use.
E.g. I'm currently about to install a new 2U chassis in one of our racks. It holds 4 independent servers each with with dual 6 core 2.6GHz Intel CPUs, 32GB RAM and a SSD RAID subsystem that easily gives a 500MB/sec throughput.
Total leasing cost + cost of a half rack in that data centre + 100Mbps of bandwidth is ~ $2500/month. Oh, and that leaves us with 20U of space for other servers, so every additional one adds $1500/month for the next 7-8 or so of them (when counting some space for switches and PDU's). Amortized cost of putting 2U with 100Mbps in that data centra is more like $1700/month.
Amazon doesn't have anything remotely comparable in terms of performance. To be charitable to EC2, at the low end we'd be looking at 4 x High Mem Quadruple Extra Large instances + 4 x EBS volumes + bandwidth and end up in the $6k region (adding the extra memory to our servers would cost us an extra $100-$200/month in leasing cost, but we don't need it), but the EBS IO capacity is simply nowhere near what we see from a local high end RAID setup with high end SSD's, and disk IO is usually our limiting factor. More likely we'd be looking at $8k-$10k to get anything comparable through a higher number of smaller instances).
I get that developers like the apparent simplicity of deploying to AWS. But I don't get companies that stick with it for their base load once they grow enough that the cost overhead could easily fund a substantial ops team... Handling spikes or bulk jobs that are needed now and again, sure. As it is, our operations cost in man hours spent, for 20+ chassis across two colo's is ~$120k/year. $10k/month or $500/per chassis. So consider our fully loaded cost per box at ~$2200k/month for quad-server chassis of the level mentioned above with reasonably full racks. Lets say $2500 again to be charitable to EC2...
This is with operational support far beyond what Amazon provides, as it includes time from me and other members of staff that knows the specifics of our applications, handles backups, handles configuration and deployment etc.
I've so far not worked on anything where I could justify the cost of EC2 for production use for base load, and I don't think that'll change anytime soon...
If disk performance is important you can also take a look at the High IO instances, which give you 2x 1TB SSDs, 60GB of RAM and 35 ECUs across 16 virtual cores. At 24x7 for 3 years you end up with ~$656/mo per instance, plus whatever you would need for bandwidth. By the time you fill up an entire rack it still ends up being slightly more expensive than your amortized 2U cost, but you also don't need to scale it up in 2U increments.
Completely agree, building your own is cheaper, gives more control, etc. But what is more: you do NOT lose the ability to use the cloud for added reliability: it is pretty cheap to have an EC2 instance standing by that you fail over to.
If you are very database heavy, and you want to be able to replicate that to the cloud in real time it does get expensive, but if you can tolerate a little downtime while the database gets synced up and the instances spin up that's cheap too.
We have SQL Server 2008 boxes with 128GB+ of ram; we're able to run all of our production databases right out of memory. This would be cost-prohibitive in a virtualize environment such as AWS, Linode, etc.
Did you know that many websites operated BEFORE Amazon Web Services existed? Perhaps going back to 2008 could give us some ideas for alternate deployment methodologies...
For the very early stage, perhaps. Once you're dealing with more than a handful of instances, it is extremely likely you'd save a substantial amount of money moving your base load off EC2.
I wish that DNS could just switch over to another availability zone when this happens. A datacenter with all the sql replicated. Sure you'd pay twice as much for EBS, but it could also double as a backup. As for the other resources at the availability zone, they are hardly utilized until they spring into action (EC2, etc.)
Take care when treating a high-availability set up such as this as a back up: if you are replicating all the changes between 2 database servers and an application error (e.g. not a database outage) causes some kind of data corruption, you are hosed if the corruption replicates and you don't also have some previous "snapshot" of the data that you can roll back to.
You can do this, depending on your database. I wrote a blog post going over some of the techniques you can use with the AWS stack a while back - http://bit.ly/TD13iH
Nothing has changed in the stack. Robert has discovered and eliminated a series of bottlenecks, causing performance to oscillate about tolerable. Finding bottlenecks is not trivial, because Arc has zilch in the way of profiling, but fortunately Robert is good at this sort of thing.
You can buy some seriously big boxes, and easily split off a lot of services onto multiple boxes. The big problem with the "single big box" strategy is being able to do upgrades -- I see hn go down frequently for 5-10 min at a time in the middle of the night, which I assume is upgrades/reboots.
The happy medium is probably splitting database (master/slave at least) and cdn (if needed) and some other services (AAA? logging?) out, and then having 2+ front end servers with load balancing for availability.
The Arc process hosting HN blows up at least once an hour (I wouldn't be surprised if there was a cronjob restarting it) and much more frequently in peak usage periods.
You wouldn't notice if it weren't for the use of closures for every form and all pagination, every time the process dies all of them are invalid (except in the rare case that they lead to a random new place!).
There's no database, everything is in-memory loaded on-demand from flat files. That wouldn't be so bad except that it's all then addressed by the memory locations rather than the content identifiers! There can be only one server per app, and to keep it real interesting PG hosts all the apps on the same box, during YC application periods he regularly limits HN to keep the other apps more available.
hn doesn't have a database capable of master/slave as such...so I think this will be harder if it ever becomes popular enough. I don't think it gets enough traffic it's ever likely to exceed what you can fit in a single box, from what I know.
In the first 99 comments of this page, average comment text size is 231 bytes. Counting all comments in articles on the front page right now, there's 1678 of them, making somewhere around 388kb of comments for the past 12 hours.
So for safety's sake round that to 1mb/day and multiply by site age (5 years).
That gets us 1825mb, projecting forward it's difficult to imagine a time when a single recent SSD on a machine with even average RAM wouldn't be able to handle all of HN's traffic needs. Considering the recent beefy Micron P320h and its 785kIOP/sec, that could serve the entire comment history of Hacker News to the present day once every 2 seconds, assuming it wasn't already occupying a teensy <2gb of RAM.
Even if Arc became a burden, a decent NAS box, gigabit Ethernet, and a few front end servers would probably take the site well into the future. Assuming exponential growth, Hacker News comments would max out a 512GB SSD sometime around 2020, or 2021 assuming gzip bought a final doubling.
Clearly pg should release the dataset and institute a annual round of hn golf where participants compete by recreating hn and trying to get the best performance for a given (changing) deployment target (SSD vs HDD, different RAM & CPU).
Well, for one; the site is basically an engine for rendering out <table><tr><td><a> ..., without much in the way of complex and frequent client-side requests.
(this isn't to downplay the challenges faced by scaling a site with the amount of traffic HN gets)
The N. Virginia datacenter has been historically unreliable. I moved my personal projects to the West Coast (Oregon and N. California) and I have seen no significant issues in the past year.
N. Virginia is both cheaper and closer to the center of mass of the developed world. I'm surprised Amazon hasn't managed to make it more reliable.
One thing we discovered this morning: it appears the AWS console itself is hosted in N Virginia.
This means that if you were trying to make changes to your EC2 instances in the West using the GUI, you couldn't, even though the instances themselves were unaffected.
shouldn't amazon themselves have architected their own app to be able to move around?
I get tired of the snipes from people that "well, you're doing it wrong", as if this is trivial stuff. But if Amazon themselves aren't even making their AWS console redundant between locations, how easy/straightforward is it for anyone else?
To what extent is this just "the cobbler's kids have no shoes?"
You're close. Put another way, "inherent complexity is the problem."
What I mean by that is, the more your system is coupled, the more it is brittle.
Frankly, this is AWS's issue. It is too coupled: RDS relies on EBS, the console relies on both, etc. Any connection between two systems is a POF and must be architected to let those systems operate w/o that connection. This is why SMTP works the way it does. Real time service delivery isn't the problem, but counting on it is.
Depends. Generic interfaces and non-reliance have costs too. In general I agree that things should be decoupled, but it's not always easy or practical.
Surely true, but that's the purpose of a system in the first place: to manage complexity and make it predictable. You could argue that we have such a system in place, given how well the Internet works overall. The fact that this system has problems goes against what I believe is fully evident proof that such a system can, in fact, work even better.
We're not talking about a leap in order of magnitude of complexity here—just simple management of common human behavioral tendencies in order to promote more reliability. "The problem is inherently complex" is always true and will always be true, but it's no excuse for not designing a system to gracefully handle that complexity.
The internet works because it provides very weak consistency guarantees compared to what businesses might require out of an EC2 management console. (IMO.)
Their CoLo space is the same space shared by AOL and a few other big name tech companies. It's right next to the Greenway, just before you reach IAD going northeast. That CoLo facility seems pretty unreliable in the scheme of things; Verizon and Amazon both took major downtime this summer when a pretty hefty storm rolled through VA[1], but AOL's dedicated datacenters in the same 10 mile radius all experienced no downtime whatsoever.
To be fair, the entire region was decimated by that storm. I didn't have power for 5 days. Much of the area was out. There was a ton of physical damage. That's not excusing them, they should do better, but that storm was like nothing I've experienced living in the area for 20 years.
Yeah, it's got to be much larger than the other regions, so it makes sense that we see more errors. Since error_rate = machines * error_rate_per_machine.
No, I calculated the error rate for the region. If us-east-1 has 5 times the machines (or availability zones, or routers, or EBS backplanes, or other thing-that-can-fail) as us-west-1, we would expect to see us-east-1 have each type of error occur about 5 times as often as us-west-1.
I'm surprised amazon hasn't built another region in the east. If you're in the west you get US-West-1 and US-West-2 and can failover and distribute between the two, why don't they have that kind of duplication in the east?
us-east-1 was 11 different datacenters last time I bothered to check.
us-west-2 by comparison is two datacenters. The reason west-1 and west-2 exist is because they are geographically diverse enough to prevent low latency inner-connections (and also have dramatically different power costs so they bill differently).
then how come when east goes down, it always seems to take down all the AZs in the region, never just one AZ? As long as the region fails like a single datacenter, i'll think of it like a single datacenter.
They already expanded into a DC in Chantilly, one more in Ashburn and I believe one in Manassas. But they lean on Ashburn for everything they do, and a small problem results in a daisy-chain failure (which because everyone uses Amazon for every service imaginable, means even the smallest problem takes down whole websites)
I don't understand why anyone's site is only in one datacenter. i thought the point of AWS was that it was distributed with fault tolerance? Why don't they distribute all the sites/apps across all their centers?
It takes development/engineering resources, and additional hardware resources to make your architecture more fault-tolerant and to maintain this fault-tolerance over long periods of time.
Weigh this against the estimated costs of your application going down occasionally. It's really only economical for the largest applications (Netflix, etc.) to build these systems.
disagree. the only area it really hurts the wallet is multi-AZ on your RDS, because it doubles your cost no matter what and RDS is toughest to scale horizontally. The upside is if you scale your data layer horizontally you don't need to use RDS anymore.
two c1.medium, which are very nice for webservers, are enough to host >1M pageviews a month (wordpress, not much caching) and cost around $120/mo each, effective $97/mo if you prepay for 12months at a time via reserved instances.
The other issue is that you can have redundant services, but when the control plane goes down - you are screwed.
Every day I have to build basic redundancy into my applications I wish that we could just go with a service provider (like Rackspace / Contegix) that offered more redundancy at the hardware level.
I know the cloud is awesome and all, but having to assume your disks will disappear, fail, go slow at random uncontrollable times is expensive to design around.
If you don't have an elastic load, then the cloud elasticity is pointless - and is ultimately an anchor around your infrastructure.
Us-west-2 is about the same cost as us-east these days. And latency is only ~10ms more than us-west-1. I'm puzzled that people aren't flocking to us-west-2. I can't the last time there was an outage there either.
To be fair, AWS downtime always make the news because they affect a lot of majors websites, but that doesn't mean an average sysadmin (or devops, whatever) would do better in term of uptime with his own bay and his toys.
But this is part of the problem: we have multiple web properties, and the fact that AWS issues can affect all of them at once is a huge downside. Certainly, if we ran on metal, we would have hardware fail, but failures would be likely to be better-isolated than at Amazon.
1. Calculate the odds that a company with the resources of Amazon will be able to provide you better overall uptime and fault tolerance than you yourself could.
2. Calculate the cost of moving to the Oregon AWS datacenter.
3. Reassure your investors that outsourcing non-core competencies is still the way to go.
> we can actually get into the datacenter to fix it.
But better and faster than Amazon?
I'd rather spend three hours at home saying "Shit. Well, we'll just wait for Amazon to fix that", than dropping my dinner, driving to the datacenter, and spend three hours setting up a new instance and restoring from backup.
For added fun, their EC2 console is down. I got this for a while:
<html><body><b>Http/1.1 Service Unavailable</b></body> </html>
... then an empty console saying "loading" for the last 20 minutes. Then recently it upgraded to saying "Request limit exceeded." in place of the loading message (because hey, I'd refreshed the page four times over the course of 20 minutes).
On the upside, their status page shows all green lights.
They have standardized icons to represent various levels of issues (orange = perf issue, red = disruption). But they don't even use them. Instead they add [i] to the green icons to indicate perf issues (Amazon Elastic Compute Cloud - N. Virginia) and disruptions of service (Amazon Relational Database Service - N. Virginia).
Maybe this status page is controlled by marketing bozos who want to pretend the situation is not so bad.
Does Amazon explain anywhere what regions, availability zones or whatever have single points of failure for each of their services? I guess that's supposed to be what an "availability zone" is but somehow that doesn't quite seem to capture it. It's pretty hard to build reliable apps without knowing where the single points of failure are in the underlying infrastructure.
All of our EC2 hosts appear to be functioning fine, but they can't connect to their RDS instance which renders our app useless. If you scroll down the page you'll also see that RDS instances are having connectivity issues. Not sure if it's related but for RDS users the impact is far worse.
EDIT: We are also using multi-AZ RDS, so either Amazon's claims for multi-AZ are bs, or their claims that this is only impacting a single zone is bs.
Because RDS is built on EBS, any slight issue with EBS manifests itself as a nastier issue for RDS.
Interestingly, EBS will never return an I/O error up to the attached OS, which is likely a good decision as most OSes choke on disk errors. What this means, however, is that if something even get slow within EBS (let alone stuck), applications that are dependent on it will suffer. Most of these applications (such as databases) have connection/response timeouts for their clients, so while EBS might just be running slowly, a service like RDS will throw up connection errors instead of waiting even a bit more.
You can imagine the cascading errors that might result from such a situation (instance looks dead, start failover...etc)
I had one multi-az failover correctly, however the security group was refusing connections to the web servers ec2 security group. I had to manually add in the private ips of the ec2 instances. It appears the API issue is affecting security group to ip lookups.
Per the linked dashboard, some instances in a single AZ in a single Region are having storage issues. Calling EC2 "down" is a bit dramatic, provided AMZN are being sincere with their status reports. Any system that can competently fail over to another AZ will be unaffected.
I would agree with you, but Amazon is just downright dishonest in their reports, which makes me sad, because I love Amazon. Go look at the past reports, they've never shown a red market, only "degraded performance" even when services for multiple availability zones went down at the same time due to their power outage (so had you architected to multiple AZs you were still fucked). When they have a single AZ go down, they won't even give it a yellow marker on the status page, they'll just put a footnote on a green marker. It makes their status dashboard pretty much useless for at a glance checking (why even have colors if they don't mean anything?)
Read their report from the major outage earlier this year, they start out by saying "elevated error rates", when many services were in fact down, and it wasn't until hours later they finally admitted to having an issue that affected more than just one availability zone.
From Forbes:
”We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.” By 11:49 EST, it reported that, ”Power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online.” But by 12:20 EST the outage continued, “We are continuing to work to bring the instances and volumes back online. In addition, EC2 and EBS APIs are currently experiencing elevated error rates.” At 12:54 AM EST, AWS reported that “EC2 and EBS APIs are once again operating normally. We are continuing to recover impacted instances and volumes.”
It's like grade inflation. You can never give out an F (Mr. Admissions officer, are you so bad at your job that you would admit such an unqualified student?), so a Gentleman's C is handed around. In Amazon's case, it's a gentleman's B+ (green, with an info icon).
A: fine
A-: problems
B+: servers are on fire
I really like Amazon as a company, use a lot of their services, but this is dishonest.
The last Netflix post mortem mentioned they had a bug in their configuration where they kept sending traffic to already down ELB instances, which was the cause of the last outage for them if I remember correctly.
Not necessarily, it could be some element of the Netflix architecture that due to their size and/or design trade-offs has taken longer / is harder to eliminate than it would be for others.
Other services, like Twilio, have come through several of these major problems with US-EAST generally unscathed while Netflix has had issues repeatedly.
According to a site which doesn't document what its reports are based on. Given that Netflix worked for me during that period, I'm suspicious that downrightnow might be using EBS somewhere.
This kind of thinking is poisonous. I know it's in good fun and it's fun to look for connections in things, but it is actually preposterous to think that Amazon would purposely disrupt wide swaths of highly paying customers for much of a day to bury one story about bad customer relations. My guess is there are a lot of people working very hard to try and solve this problem right now, let's not belittle their efforts because of a conspiracy, let's belittle their efforts because of bad systems design.
He knows you were joking. It's a bad joke. "It's a joke" is not a magic bullet that means you can do no wrong.
Poisonous ideas spread as jokes. That is one of the ways they spread. A person thinking well about the issue wouldn't find the joke funny because it doesn't make sense. The joke relies on some poisonous, bad thinking to be understood. It has bad assumptions, and a bad way of looking at the world, built in.
Let me explain to you why it is a funny joke. It is funny because it involves Amazon undertaking massive technical measures, with huge reputational damage in order to try to kill a story which is primarily not spreading via Amazon-hosted sites anyway.
It's akin to a man with athlete's foot deciding to remedy it by discharging a shotgun into his leg.
It's easy to understand shooting a leg with a shotgun. That's a simple thing.
The Amazon thing in question is far more complicated, and far harder to understand.
Thinking they are "akin" is a mistake. It shows you're thinking about it wrong and failing to recognize how completely different they are.
One isn't going to confuse anyone or be misunderstood, the other will confuse most of the population and be misunderstood by most people.
One, if someone misunderstood, only involves one individual being an idiot. The other involves a large company being evil and thus can help feed conspiracy theories.
I'm not sure if you are aware of the difficulty of Amazon doing this. Suppose Jeff Bezos wants to do it. He can't simply order people to do it because they will refuse and leak it to the media and he'll look really bad and then he'll definitely have to make sure to try super hard for there to be no outages anytime soon.
Shooting yourself in the foot is stupid but easy. Doing this is stupid and essentially impossible. To think it's possible requires thinking that Amazon has a culture of unthinking obedience, or has an evil culture that all new hires are told about and don't leak to the media. Totally different.
Casually talking about impossible, evil conspiracies by big business, as if they are even possible, is a serious slander against those businesses, capitalism, and logic. Slandering a bunch of really good things -- especially ones that approximately 50% of US voters want to harm -- and then saying "it's just a joke, it's funny" is bad.
Casually joking about impossible, evil conspiracies by big business on the other hand is something completely different and also funny.
No one will believe its related and its certainly not slander to joke about it. Also you might want to leave the political opinions out of hacker news... there is no 50% of US who dislikes those things, they only have different ideas about how to support it.
When I commented the two other comments replying to yours were taking it seriously, so I decided to take it seriously too. Text is a pretty bad medium for hearing tone, sorry I misinterpreted yours; for some reason the "the cynical side of me" bit made me think your comment wasn't entirely in jest.
I have to deal with a number of folks who will be overjoyed to read this news when their tech cartel vendor of choice forwards it this evening.
There's a huge contingent of currently endangered infrastructure folks (and vendors who feed off them) out there who throw a party every time AWS has a visible outage.
AWS sucking at availability (and especially specific parts like EBS, and then services built on top of EBS and on top of AWS) doesn't mean the correct option is to mine your own sand and dig up your own coal to run servers in your own datacenters in the basement.
Even if you're totally sold on the cloud, you can still have a requirement that things be transparent all the way down. AWS is one of the least transparent hosting options around.
If you're a customer of a regular colo, or even a managed hosting provider themselves based at a colo, it's pretty easy to dig into how the infrastructure is set up, identify areas where you need your own redundancy, etc. Essentially impossible within AWS -- there is no reason intelligible to me that ELB in multiple AZs should depend on EBS in a single AZ, but that's how they have it set up.
I'd question if they are truly wrong to send out news like this? Basically you have to weigh the realities of the 'cloud' with the perception.
The perception that many people has is that somehow the 'cloud' is a magical up-time device that will save you money in droves.
The reality is that for many companies with mid-level traffic and aren't a start-up with a billion users, the 'old' style tech might very well be your best option.
The problem is likely the same as usual: if the damn control plane is down, it doesn't matter how robust your failover architecture is, because your requests to bring up new machines go unanswered.
There's pretty much no way to architect around that one as an AWS user (apart from going fully multi-cloud, but "nobody" actually does that, at least at scale), and I'm kind of shocked that those bits of AWS are still not robust against "single AZ outages", given that they're involved in pretty much every one of these incidents and make them affect people on the entire cloud...
Currently, Netflix uses a service called "Chaos Monkey" to simulate service failure. Basically, Chaos Monkey is a service that kills other services. We run this service because we want engineering teams to be used to a constant level of failure in the cloud. Services should automatically recover without any manual intervention. We don't however, simulate what happens when an entire AZ goes down and therefore we haven't engineered our systems to automatically deal with those sorts of failures. Internally we are having discussions about doing that and people are already starting to call this service "Chaos Gorilla"."
I wish there was. Architecting a system for failover requires a mindset change, a culture change within your org, and the right technology that lets you build for it while not slowing yourself down too much. The last point being the hardest.
I would argue that none of the common full stack frameworks that startups use are fault tolerant enough for AWS. Most of them have multiple failure points that can quickly bring down entire apps.
Our app is down because it's hosted on Heroku and it's frustrating because it seems like N Virginia is the least reliable Amazon datacenter. Every year it seems to go down a couple times for at least a couple hours.
Heroku should offer a choice between N Virginia and Oregon hosting (I think they're almost comparable in price nowadays). That way people who want more uptime/reliability can choose Oregon. Sure it will be further from Europe (but then it will be closer to Asia) and people can make that choice on their own.
But basing an entire hosting service on N Virginia doesn't make sense anymore, considering the history of major downtime in that region.
Or better yet, Heroku should offer an add-on "instant failover" service that, for a premium of course, offers a package for multi-site (or, knowing they're 100% AWS, multi-datacenter) deployment with all of the best practices, etc. Seems like a logical next step for them (or a competitor) given the recent spate of outages.
Can't update my list fast enough, but other major services experiencing problems are Netflix and Pinterest. Lots of other (smaller) sites are starting to fail too.
One of the most frustrating issues here is that we have to deal with Amazon's status page for information. It's a complex page, divided by continent instead of region, which means at least 5 or 6 clicks to figure out progress. They should learn from these issues about how people want to be informed - to date, they haven't. Also, they have a twitter account, which would be the perfect fit to keep everyone up-to-date with what's going on; to at least show a human side to these issues. Alas, they're not updating that either.
I've been working with AWS since early 2006 when they first launched - I was lucky to be granted a VIP invite to try out EC2 before everyone else, and ended up launching the first public app on EC2. This might be the first time when frustration has overcome my love for these guys.
"Degraded performance," is a fairly off-handed way of putting what they're experiencing. First RDS connectivity went down the tube, and then EBS followed, finally the EC2 console is failing to operate properly (for me, in US-EAST region at least).
Of course, as soon as I read the report that the issue was confined to one AZ, I looked to move my server over to another AZ. Oh yes, two were full and refused new instances, and then surprisingly, new requests for the other AZs never were received or operated on - and now the console is failing. It's a bit more than just "slow EBS" if that's what you were thinking.
- edit: said ec2 twice in the second sentence, corrected to say ebs.
Even if they do understate the issue, It's limited to US-EAST 1, and is an EBS issue. Saying that EC2 is "down" because of this is totally off the mark - I've got dozens of EBS volmes in EAST 1 that are unaffected, plus all of the other zones that are operating normally...
From what I'm seeing, if your root disk is on EBS and your SSH keys are there, you cannot SSH into those hosts right now.
Also, the availability zones are disparate in terms of what they can support. A great number of my instances are in 1d because of unavailability in others.
Note that the region-1[a b c d & such] designations are randomized per account; my us-east-1d won't (necessarily) be the same as yours or anyone else's.
Does anyone know if this really is just one AZ? Seems like an awful lot of larger sites are down. I'd expect at least some of them to be multi-AZ by now.
This is affecting more than a single Availability Zone, but probably for reasons that have been seen before. One reason might be that an EBS failure in one AZ triggers a spike in EBS activity in other AZs which overwhelms EBS. (I believe this is what happened in April 2011).
Does anybody have any experience with migrating to Oregon or N. California in terms of speed and latency?
My dashboard says "Degraded EBS performance in a single Availability Zone". It then lists each of the five zones as "Availability zone is operating normally." http://cl.ly/image/202F3B0I371g
I was seeing that too, but it looks like Amazon has now updated the availability zone status. When I run ec2-describe-availability-zones from the command line, it's telling me that us-east-1b is impaired. (Availability zone mapping is different for each account, so my 1b may be some other availability zone for you.)
I'm sorry, am I in a time machine? Didn't this exact thing happen last year, and didn't these exact people explain exactly how they were going to make sure it never happens again?
Ironically, on the first hacking health event, our server was running heroku.. and heroku went down. 3 months later, we have the second hacking health in Toronto and all AWS is going down.
ELB is also fucked. Seen that nobody mentions it. Some of our load balancers are completely unreachable.
Just one multi-AZ RDS instance claimed the automatic failover. However, the 200+ alerts over the automatic failover due to internal DNS changes to point to the new master shows that things aren't as easily described by the RDS DB Events log.
Some instances reported high disk I/O (EBS backed) via the New Relic agent (the console still has some issues).
I reacted quickly enough to get a new instance spun up and added it to my site's load balancer, but the load balancer is failing to recognize the new instance and send traffic to it... yay. Console is totally unresponsive on the load balancer's "instances" tab. If I point my browser at the ec2 public DNS of the new instance it seems to be running just fine. So much for the load balancer being helpful.
Amazon runs all of the Amazon.com web servers on EC2 hosts (and has since 2010: https://www.youtube.com/watch?feature=player_detailpage&...). They've just made sure that they have enough hosts in each of the other AZs to withstand an outage.
That would be the ultimate testament, maybe the only stunt they could pull to PR themselves out of this mess.. if www.amazon.com ran on the same set of EC2 servers and systems as us muggles
2:20 PM PDT We've now restored performance for about half of the volumes that experienced issues. Instances that were attached to these recovered volumes are recovering. We're continuing to work on restoring availability and performance for the volumes that are still degraded.
We also want to add some detail around what customers using ELB may have experienced. Customers with ELBs running in only the affected Availability Zone may be experiencing elevated error rates and customers may not be able to create new ELBs in the affected Availability Zone. For customers with multi-AZ ELBs, traffic was shifted away from the affected Availability Zone early in this event and they should not be seeing impact at this time.
This is infuriating. Our instances aren't ELB-backed, so we have an entire AZ sitting idle while the rest of our instances are overwhelmed with traffic. Why did they make this decision for us?
I'm normally an Amazon apologist when outages happen, but this is absolutely ridiculous.
You have a virtualized platform, on top of which are many pieces like load-balancing, EBS, RDS, the control-plane itself, etc.
You have burstable network connections which by their nature, will have hard limits (you can't burst above 10Gbps on a 10Gbps pipe, for example; even assuming the host machine is connected to a 10Gbps port).
Burstable (meaning quite frankly, over-provisioned) disk and CPU resources.
And if any piece fails, you may well have downtime...
It is always surprising to me, that people feel that layering complexity upon complexity, will result in greater reliability.
Hosted on heroku with a hosted heroku postgres dev instance, I observed a drop in the db response time :
- from a consistent average of 10ms over the last week
- to a new consistent average of ~2.5ms
beetwen 17:31 and 17:35 UTC. AWS started to report the current issue at 17:38 UTC. My app then experienced some intermittent issues (reported by newrelic pinging it every 30s). Don't know if it's related, could it be some sort of hw upgrade that went wrong ?
I did a push affecting my most used queries but that was one and a half hour sooner, at 15:56 UTC, so probably unrelated.
I thought at least Reddit had learned the "don't use EBS" rule from past outages - I was bitten by the April/2011 outage and switched everything over to instance storage shortly thereafter.
For most applications I think architecting EBS out should be straightforward - instance storage doesn't put a huge single-point-of-failure in the middle of your app if you're doing replication, failover, and backups properly.
And EBS seems to be the biggest component of the recent AWS failures upon which they've built a lot of their other systems.
I am frankly surprised so many still use EC2 considering how frequently it breaks. It's not cheap, so the only reasons to use it would be reliability or scale right? Why not just get lots of boxes at Hetzner and OVH (40 EUR a month for 32GB of RAM and 4x2x3.4GHz cores) and scale up / redundansize that way?
They've been down for several hours actually. I tried to get to their site this morning to check out the jedburg talk, and got the "We're on the case" message.
Building infrastructure on top of Amazon with global replication across multiple availability zones can sidestep such failures and guarantee uninterrupted operations to customers - www.pubnub.com is unaffected by AWS being down.
Not wanting to shill for Google, but we've been very happy with Appengine's reliability - particularly since moving to the HR datastore. Have other Appengine users had any significant downtime apart from MS datastore issues?
i read in my book that amazon is the best cloud service provider and seeing this thing can't really prove it.
I am still confused which cloud service i use ? Amazon is only one which Indians are currently rely on ,since latency is bit smaller for amazon servers as compared to other providers.
If you go through the pains of architecting your system to span multiple AZs, or you avoid using EBS, then you probably dodge most of the EC2 outages. (Remains to be seen if that is the case here.)
That said, I don't think most people think using the cloud means that downtime is a thing of the past. I think the more attractive proposition is when hardware breaks, or meteors hit the datacenter, etc, it is their problem, not yours. You still have to deal with software-level operations, but hardware-level operations is effectively outsourced. The question is if you think you can do a better job than Amazon -- some companies think they can, most startups know they can't.
Yeah. Even with this, they still do better than I would. My record: misconfigured air-conditioning unit alarm leading to servers being baked at high temperature over a weekend, leading to much wailing and gnashing of teeth. I now know to be really careful to set up air conditioning units properly, but what other lessons am I still waiting to learn? The main lesson that I took from this is that I should stick to what I am good at: cutting code & chewing data. :-)
I always understood the cloud to mean a black box of sorts that automatically handles failover, among other things. The cloud being just a fuzzy representation of the infrastructure.
S3 probably fits the description of a cloud service. You send your data, and the service worries about making it redundant without your intervention. If data in NE USA is unavailable, the service will automatically serve you the data from somewhere else. You don't need to know how it works.
EC2 and some of these other building blocks, however, I would not consider to be cloud services. Merely tools for building out your own cloud services to other customers who then shouldn't have to think about failover and other such concerns.
If you know you are using a server that is physically located in a certain geographic location, it need not be represented by a cloud. It is a distinct point on the network.
pg, or someone else from HN... Could you please edit this title for accuracy? Maybe, "Poorly designed sites taken out because of problems in one Amazon availability zone."
For many sites, a single server in a single zone (e.g., a non redundant server, an instance, a slice, a VM, whatever) is the right decision for ROI.
For many sites, the money spent on redundancy could be better spent on, say, Google Adwords, until they're big enough that a couple hours downtime has irreplaceable costs higher than the added costs of redundancy (dev, hosting, admin) for a year.
Yes, it is editorializing. My point is, but I guess too subtle, that the current link text is very much an editorial comment, especially since the content at that link location has nothing to do with the sites mentioned in the link text.
This issue is affecting both an EC2 Zone and Amazon's RDS servers, which are technically multi-zone. There are a ton of well-architected apps and sites that have been affected. Unconstructive...