Hacker News new | past | comments | ask | show | jobs | submit login
Amazon EC2 currently down. Affecting Heroku, Reddit, Others (amazon.com)
602 points by fredoliveira on Oct 22, 2012 | hide | past | favorite | 295 comments



Convenient that we're too backward to use AWS. That means everyone can at least talk about it here when AWS is down.


Funny thing is, the last couple interviews I've had in Chicago and Silicon Valley, I actually get points when explaining caution is necessary when using AWS for production.

A magic bullet it isn't.


I can't feel that Amazon are a bit of a Cassandra (mythological not the software) when these outages occur.

They recommend that people failover to other availability zones but no one puts any effort into doing it then they get annoyed when a datacenter goes offline.

Its not Amazons fault that you didn't make your service failure tolerant - its your fault!


I'm seeing a lot of these type of comment. The thing is, AWS completely crapped out. Don't believe their status updates that make it sound like it was a tiny little area of their data center. It was pretty much the entire zone and then whenever there is an outage affecting an entire zone it brings down global services and even other zones as well.

We had servers in the bad zone and started having load issues. When I went to use the cool cloud features that are made for this, the entire thing completely fell on its face. I couldn't launch new EC2 servers either because the API was so bogged down, or the new zone I was launching in was restricted because of load.

Basically, the thing that nobody keeps in mind when they think it's so cool that you can spin up servers to work around outages is that EVERYONE IS DOING THAT. This is Amazon's entire selling point and when it comes to doing it, it doesn't work!

We were lucky to get some new servers launched before the API pretty much completely went down. They started giving everyone errors saying request limit exceeded. The forums were full of people asking about it.

ELB, Elastic IP, and other services not associated with a single availability zone completely failed. I keep seeing comments saying that if people designed their stuff right, they wouldn't have an issue. That's just completely bull, AWS has serious design flaws and they show up at every outage. It's NOT just people relying on a single zone.


Totally Agree. A lot of people don't know this, or substitute alternatives which are not necessarily viable. Among the tenants of reliability is isolation. The nature of Amazon's services is that it isolates at the datacenter level. One should isolate at the level in which they are comfortable taking on failures. Once there is an active dependency, a la EBS, the number of subsystems increase multi-fold and the likelihood of failure & cascading failure dramatically increases.

Where getting a bit from disk to memory used to be: platter -> diskcontroller -> cpu -> memory,

now with SANs & NFS & virtualized block storage, it's: platter -> diskcontroller -> cpu -> memory -> nic -> wire -> switch(es)/router(s)/network configs(human config item) -> wire -> nic -> cpu -> memory.

Not to say that centralized storage doesn't have its benefits, but now the scope of isolation has drastically increased, which when considering the combinatorial possibilities of failure in the prior scenario vs the latter, the latter has a significantly larger chance and mode of failure that is significantly more difficult to programmatically automate failover.

TLDR: With amazon, the scope is isolation is the datacenter. To be on amazon, one must architect and design at the scope of handling failure at the datacenter level, rather than at the host or cluster level.


We didn't have downtime for various reasons but the ELBs we were using failed and the queue of starting instances was too big to see our few ones restarting.

The main systematic issue in EC2 is EBS, take that away and it will almost completely remove downtimes.


The problem with ELBs is that they are themselves EC2 instances and use many of the global services for detecting load, scaling up, etc.. Like all of AWS's value-added services, they are therefore more likely to fail during an outage event, not less likely, as they depend on more services.


Right. Blame it on the victim. How do you make a "fault tolerant" service when core services like ELB together with the API behind it start to fail? Multi-region? Multi-cloud? When is it "designed to make web-scale computing easier" part supposed to kick in? With half-baked producs like ELB or things like EIP that cease to work when you need them the most?

I actually asked the AWS Premium support regarding the ELB multi-AZ issues, in order to actually make things easier for everyone. This is the answer I got:

"As it stands right now, you would need to make a call to ELB to disable the failed AZ. It may be possible for you to programatically/script this process in the case of an event.

Going forward, this is something that we would like to address but I don't have any ETA for when something like this might be implemented."


IMO, there's plenty of blame to go around, however the onus really should be on the individuals that are making the decision to go on Amazon and trust that there services will always be up. Unfortunately, some people don't know, so they will just blindly choose Amazon for their name recognition.

For the places that truly care about reliability and have the technical staff to make informed decisions, they should understand the limits of reliability with various architectures. As I mentioned before, one of the tenant of reliability is isolation. When the scope of isolation is increased (e.g. single host vs multi host), one must also handle failures at that scope. Amazon isolates at the datacenter level. So should those utilizing Amazon's offerings.


My question is this...why doesn't a service like Heroku which acts as a Paas...have this built in? I'm on their blog now trying to understand their complete setup..


perhaps they should make it easier to do so, as in having some default option you could select, at a premium of course.


Perhaps they should - though Rightscale.com provides a service that helps you do just that. (Disclaimer: I do not work for Rightscale or know anyone that works there)


Just like: passwords, backups, etc.

It reminds me of my father: I used to interrupt him with "I know dad!" when he was chastising me. His response was simple "If you knew, then why did[n't] you do it?"


I most certainly is a magic bullet.

Slays millions with a single round.


It's bizarre the way "cloud" makes so many people think disks never fail, networks are perfect and data centers always run smoothly. Now we'll get the backlash blog posts from people ditching the cloud – and I'm jousting waiting for the inevitable rebound outages when they learn that high availability requires geographic redundancy either way.


You find it bizarre that the cloud providers' marketing strategy has worked? I find that bizarre!


That's just silly given how much of Amazon's documentation strongly encourages you to use multiple AZs and regions for reliability. I've included a sampling of their whitepapers below; this is also what their salespeople tell you and what you have to click past any time you provision an RDS instance, ELB, etc.

http://media.amazonwebservices.com/AWS_Cloud_Best_Practices....

“Be a pessimist when designing architectures in the cloud”

http://media.amazonwebservices.com/AWS_Web_Hosting_Best_Prac... “As the AWS web hosting architecture diagram in this paper shows, we recommend that you deploy EC2 hosts across multiple Availability Zones to make your web application more fault-tolerant.”

http://media.amazonwebservices.com/AWS_Operational_Checklist...

“We have deployed critical components of our applications across multiple availability zones, are appropriately replicating data between zones, and have tested how failure within these components affects application availability”


Fair enough, I stand corrected. The problem is in the culture surrounding cloud services, not with the providers themselves.

So it seems that the only real benefit to utilising cloud services is to make scaling up easier and save money.


I think the main problem is that "the cloud"'s primary audience are developers, not sysadmins. Many developers simply don't appreciate that what you're getting is a [much] easier path to automating your server provisioning and management but you're still in exactly the same position as before regarding any bit of infrastructure's ability to fail at the least convenient moment.


Caution and understanding is necessary when choosing any infrastructure provider; AWS is not a special case.


Can we trust the cloud services?


Downtime is inevitable for most companies. The only question is how much work you want to put in to achieve the standard up time.


Whats the alternative? Building your own is certainly not.


As a start, be present in multiple EC2 availability zones (not just US-east-2, basically) and regions (this is harder). Cross-region presence needn't be active-active, just a few read-only database slaves and some machines to handle SSL termination ("points of presence") for your customers on the east coast. Perform regular "fire drills" where you actually fail over live traffic and primary databases from one AZ/one region to another.

"Building your own" is also something very few people (including Amazon itself up until fairly late, probably after the IPO) do: you can use a managed hosting provider (very common, usually cheaper than EC2) or lease colo space (which doesn't imply maintaining on-site personnel in the leased space: most colos provide "remote hands"). You can still use EC2 for async processing and offline computation, S3 for blob storage, etc... or even S3 for "points of presence" on different US coasts, Asia/Pacific, Europe, but run databases, et al in a leased colo or a managed hosting provider.

Yes, these options are more expensive than running a few instances in a single EC2 AZ: but that's the price of offering high availability SLA to your customers. It's a business decision.


We run gear in multiple physical locations, but both the application and data is stored/backed in S3. This allows us the redundancy of S3 without the cost and fragile nature of EC2.


Indeed, there are many ways to complement physical colocation with AWS.


Old school colo/dedicated servers/etc. There's something delightfully simple about only having to deal with "standard" hardware failures.


Not to mention unless you have very unusual traffic patterns (spin up lots of servers for short periods of time), colo/dedicated servers will usually be vastly cheaper than EC2, especially because with a little bit of thought you can get servers that are substantially better fit for your use.

E.g. I'm currently about to install a new 2U chassis in one of our racks. It holds 4 independent servers each with with dual 6 core 2.6GHz Intel CPUs, 32GB RAM and a SSD RAID subsystem that easily gives a 500MB/sec throughput.

Total leasing cost + cost of a half rack in that data centre + 100Mbps of bandwidth is ~ $2500/month. Oh, and that leaves us with 20U of space for other servers, so every additional one adds $1500/month for the next 7-8 or so of them (when counting some space for switches and PDU's). Amortized cost of putting 2U with 100Mbps in that data centra is more like $1700/month.

Amazon doesn't have anything remotely comparable in terms of performance. To be charitable to EC2, at the low end we'd be looking at 4 x High Mem Quadruple Extra Large instances + 4 x EBS volumes + bandwidth and end up in the $6k region (adding the extra memory to our servers would cost us an extra $100-$200/month in leasing cost, but we don't need it), but the EBS IO capacity is simply nowhere near what we see from a local high end RAID setup with high end SSD's, and disk IO is usually our limiting factor. More likely we'd be looking at $8k-$10k to get anything comparable through a higher number of smaller instances).

I get that developers like the apparent simplicity of deploying to AWS. But I don't get companies that stick with it for their base load once they grow enough that the cost overhead could easily fund a substantial ops team... Handling spikes or bulk jobs that are needed now and again, sure. As it is, our operations cost in man hours spent, for 20+ chassis across two colo's is ~$120k/year. $10k/month or $500/per chassis. So consider our fully loaded cost per box at ~$2200k/month for quad-server chassis of the level mentioned above with reasonably full racks. Lets say $2500 again to be charitable to EC2...

This is with operational support far beyond what Amazon provides, as it includes time from me and other members of staff that knows the specifics of our applications, handles backups, handles configuration and deployment etc.

I've so far not worked on anything where I could justify the cost of EC2 for production use for base load, and I don't think that'll change anytime soon...


If disk performance is important you can also take a look at the High IO instances, which give you 2x 1TB SSDs, 60GB of RAM and 35 ECUs across 16 virtual cores. At 24x7 for 3 years you end up with ~$656/mo per instance, plus whatever you would need for bandwidth. By the time you fill up an entire rack it still ends up being slightly more expensive than your amortized 2U cost, but you also don't need to scale it up in 2U increments.


Completely agree, building your own is cheaper, gives more control, etc. But what is more: you do NOT lose the ability to use the cloud for added reliability: it is pretty cheap to have an EC2 instance standing by that you fail over to.

If you are very database heavy, and you want to be able to replicate that to the cloud in real time it does get expensive, but if you can tolerate a little downtime while the database gets synced up and the instances spin up that's cheap too.


We have SQL Server 2008 boxes with 128GB+ of ram; we're able to run all of our production databases right out of memory. This would be cost-prohibitive in a virtualize environment such as AWS, Linode, etc.


Did you know that many websites operated BEFORE Amazon Web Services existed? Perhaps going back to 2008 could give us some ideas for alternate deployment methodologies...



Well the reason why seed money is so low these days is because people expect you to not spend all the money on making your own cloud.


For the very early stage, perhaps. Once you're dealing with more than a handful of instances, it is extremely likely you'd save a substantial amount of money moving your base load off EC2.


Building your own what?


I wish that DNS could just switch over to another availability zone when this happens. A datacenter with all the sql replicated. Sure you'd pay twice as much for EBS, but it could also double as a backup. As for the other resources at the availability zone, they are hardly utilized until they spring into action (EC2, etc.)


"...but it could also double as a backup."

Take care when treating a high-availability set up such as this as a back up: if you are replicating all the changes between 2 database servers and an application error (e.g. not a database outage) causes some kind of data corruption, you are hosed if the corruption replicates and you don't also have some previous "snapshot" of the data that you can roll back to.


You can do this, depending on your database. I wrote a blog post going over some of the techniques you can use with the AWS stack a while back - http://bit.ly/TD13iH


I'm not familiar with HN's technical stack (other than arc), how has it scaled as the community grew over the years?


Barely.

Nothing has changed in the stack. Robert has discovered and eliminated a series of bottlenecks, causing performance to oscillate about tolerable. Finding bottlenecks is not trivial, because Arc has zilch in the way of profiling, but fortunately Robert is good at this sort of thing.


I think it's a case of "well optimized code which does as little as possible can handle an awful lot of users on a single beefy box".


Until it can't. Then five years later you get Spanner.


You can buy some seriously big boxes, and easily split off a lot of services onto multiple boxes. The big problem with the "single big box" strategy is being able to do upgrades -- I see hn go down frequently for 5-10 min at a time in the middle of the night, which I assume is upgrades/reboots.

The happy medium is probably splitting database (master/slave at least) and cdn (if needed) and some other services (AAA? logging?) out, and then having 2+ front end servers with load balancing for availability.


The Arc process hosting HN blows up at least once an hour (I wouldn't be surprised if there was a cronjob restarting it) and much more frequently in peak usage periods.

You wouldn't notice if it weren't for the use of closures for every form and all pagination, every time the process dies all of them are invalid (except in the rare case that they lead to a random new place!).

There's no database, everything is in-memory loaded on-demand from flat files. That wouldn't be so bad except that it's all then addressed by the memory locations rather than the content identifiers! There can be only one server per app, and to keep it real interesting PG hosts all the apps on the same box, during YC application periods he regularly limits HN to keep the other apps more available.


hn doesn't have a database capable of master/slave as such...so I think this will be harder if it ever becomes popular enough. I don't think it gets enough traffic it's ever likely to exceed what you can fit in a single box, from what I know.


Couldn't help but do a little digging..

In the first 99 comments of this page, average comment text size is 231 bytes. Counting all comments in articles on the front page right now, there's 1678 of them, making somewhere around 388kb of comments for the past 12 hours.

So for safety's sake round that to 1mb/day and multiply by site age (5 years).

That gets us 1825mb, projecting forward it's difficult to imagine a time when a single recent SSD on a machine with even average RAM wouldn't be able to handle all of HN's traffic needs. Considering the recent beefy Micron P320h and its 785kIOP/sec, that could serve the entire comment history of Hacker News to the present day once every 2 seconds, assuming it wasn't already occupying a teensy <2gb of RAM.

Even if Arc became a burden, a decent NAS box, gigabit Ethernet, and a few front end servers would probably take the site well into the future. Assuming exponential growth, Hacker News comments would max out a 512GB SSD sometime around 2020, or 2021 assuming gzip bought a final doubling.


Clearly pg should release the dataset and institute a annual round of hn golf where participants compete by recreating hn and trying to get the best performance for a given (changing) deployment target (SSD vs HDD, different RAM & CPU).


HN traffic just needs to grow more slowly than computing power, which seems reasonably likely


Well, for one; the site is basically an engine for rendering out <table><tr><td><a> ..., without much in the way of complex and frequent client-side requests.

(this isn't to downplay the challenges faced by scaling a site with the amount of traffic HN gets)


never fix a working system right?


Backwards are the sites with unplanned downtime.


The N. Virginia datacenter has been historically unreliable. I moved my personal projects to the West Coast (Oregon and N. California) and I have seen no significant issues in the past year.

N. Virginia is both cheaper and closer to the center of mass of the developed world. I'm surprised Amazon hasn't managed to make it more reliable.


One thing we discovered this morning: it appears the AWS console itself is hosted in N Virginia.

This means that if you were trying to make changes to your EC2 instances in the West using the GUI, you couldn't, even though the instances themselves were unaffected.


shouldn't amazon themselves have architected their own app to be able to move around?

I get tired of the snipes from people that "well, you're doing it wrong", as if this is trivial stuff. But if Amazon themselves aren't even making their AWS console redundant between locations, how easy/straightforward is it for anyone else?

To what extent is this just "the cobbler's kids have no shoes?"


If it's systematically difficult to do it correctly, then the system is wrong.


. . . or the problem is inherently complex.


> . . . or the problem is inherently complex.

You're close. Put another way, "inherent complexity is the problem."

What I mean by that is, the more your system is coupled, the more it is brittle.

Frankly, this is AWS's issue. It is too coupled: RDS relies on EBS, the console relies on both, etc. Any connection between two systems is a POF and must be architected to let those systems operate w/o that connection. This is why SMTP works the way it does. Real time service delivery isn't the problem, but counting on it is.

Uncouple all the things!


Depends. Generic interfaces and non-reliance have costs too. In general I agree that things should be decoupled, but it's not always easy or practical.


Surely true, but that's the purpose of a system in the first place: to manage complexity and make it predictable. You could argue that we have such a system in place, given how well the Internet works overall. The fact that this system has problems goes against what I believe is fully evident proof that such a system can, in fact, work even better.

We're not talking about a leap in order of magnitude of complexity here—just simple management of common human behavioral tendencies in order to promote more reliability. "The problem is inherently complex" is always true and will always be true, but it's no excuse for not designing a system to gracefully handle that complexity.


The internet works because it provides very weak consistency guarantees compared to what businesses might require out of an EC2 management console. (IMO.)


That's what twilio + Heroku are for, abstract up another layer. There's even a site where you just give it a github location and it does the rest.


Well the Heroku abstraction was leaking like a sieve today.


Hardly


Their CoLo space is the same space shared by AOL and a few other big name tech companies. It's right next to the Greenway, just before you reach IAD going northeast. That CoLo facility seems pretty unreliable in the scheme of things; Verizon and Amazon both took major downtime this summer when a pretty hefty storm rolled through VA[1], but AOL's dedicated datacenters in the same 10 mile radius all experienced no downtime whatsoever.

Edited: [1] http://www.datacenterknowledge.com/archives/2012/06/30/amazo...


To be fair, the entire region was decimated by that storm. I didn't have power for 5 days. Much of the area was out. There was a ton of physical damage. That's not excusing them, they should do better, but that storm was like nothing I've experienced living in the area for 20 years.


Realistically, it's at least in part because everyone defaults to the East region. So it's the most crowded and demanding on the system.


Yep, according to the most recent estimate I saw[1], us-east was more than twice the size of all other regions combined.

[1] http://huanliu.wordpress.com/2012/03/13/amazon-data-center-s...


It's not just because it's crowded. Everyone I know who's worked in that DC hated it. Aside from that, storms regularly knock out the grid in NoVa.


Yeah, it's got to be much larger than the other regions, so it makes sense that we see more errors. Since error_rate = machines * error_rate_per_machine.


The whole region is down, you just calculated the chance of at least one machine having an error.


No, I calculated the error rate for the region. If us-east-1 has 5 times the machines (or availability zones, or routers, or EBS backplanes, or other thing-that-can-fail) as us-west-1, we would expect to see us-east-1 have each type of error occur about 5 times as often as us-west-1.


I believe this is because North Virginia is also their historical first facility.


And largest, and busiest.


I'm surprised amazon hasn't built another region in the east. If you're in the west you get US-West-1 and US-West-2 and can failover and distribute between the two, why don't they have that kind of duplication in the east?


Stop thinking about regions as datacenters.

us-east-1 was 11 different datacenters last time I bothered to check.

us-west-2 by comparison is two datacenters. The reason west-1 and west-2 exist is because they are geographically diverse enough to prevent low latency inner-connections (and also have dramatically different power costs so they bill differently).


then how come when east goes down, it always seems to take down all the AZs in the region, never just one AZ? As long as the region fails like a single datacenter, i'll think of it like a single datacenter.


They already expanded into a DC in Chantilly, one more in Ashburn and I believe one in Manassas. But they lean on Ashburn for everything they do, and a small problem results in a daisy-chain failure (which because everyone uses Amazon for every service imaginable, means even the smallest problem takes down whole websites)


I don't understand why anyone's site is only in one datacenter. i thought the point of AWS was that it was distributed with fault tolerance? Why don't they distribute all the sites/apps across all their centers?


It takes development/engineering resources, and additional hardware resources to make your architecture more fault-tolerant and to maintain this fault-tolerance over long periods of time.

Weigh this against the estimated costs of your application going down occasionally. It's really only economical for the largest applications (Netflix, etc.) to build these systems.


disagree. the only area it really hurts the wallet is multi-AZ on your RDS, because it doubles your cost no matter what and RDS is toughest to scale horizontally. The upside is if you scale your data layer horizontally you don't need to use RDS anymore.

two c1.medium, which are very nice for webservers, are enough to host >1M pageviews a month (wordpress, not much caching) and cost around $120/mo each, effective $97/mo if you prepay for 12months at a time via reserved instances.


The other issue is that you can have redundant services, but when the control plane goes down - you are screwed.

Every day I have to build basic redundancy into my applications I wish that we could just go with a service provider (like Rackspace / Contegix) that offered more redundancy at the hardware level.

I know the cloud is awesome and all, but having to assume your disks will disappear, fail, go slow at random uncontrollable times is expensive to design around.

If you don't have an elastic load, then the cloud elasticity is pointless - and is ultimately an anchor around your infrastructure.


heroku only uses one AZ, apparently. Which is completely awful, for a PaaS...


They sell it as a feature.


Us-west-2 is about the same cost as us-east these days. And latency is only ~10ms more than us-west-1. I'm puzzled that people aren't flocking to us-west-2. I can't the last time there was an outage there either.


You can move your projects on demand?


As https://twitter.com/DEVOPS_BORAT says, At conference you can able tell cloud devops by they are always leave dinner for respond to pager.

Also, What is happen in cloud is stay in cloud because nobody can able reproduce outside of cloud.

(And many other relevant quotes.)


Possibly the most relevant:

"Source of Amazon is tell me best monitoring strategy is watch Netflix. If is up, they can able blame customer. If is down, they are fuck."


so true: "In devops is turtle all way down but at bottom is perl script. "

https://twitter.com/DEVOPS_BORAT/status/248770195580125185


To be fair, AWS downtime always make the news because they affect a lot of majors websites, but that doesn't mean an average sysadmin (or devops, whatever) would do better in term of uptime with his own bay and his toys.


SoftLayer that we use seems to be much more reliable than Amazon. At least more reliable than that particular Amazon's datacenter in Virginia.


I agree, been with SL for 3 years, never had an outage apart from a drive failure one time but it was fixed within an hour.


But this is part of the problem: we have multiple web properties, and the fact that AWS issues can affect all of them at once is a huge downside. Certainly, if we ran on metal, we would have hardware fail, but failures would be likely to be better-isolated than at Amazon.


@override: You are hellbanned.


When our gear is down, we can actually get into the datacenter to fix it.

What do you do when Amazon is down other than sweat?


1. Calculate the odds that a company with the resources of Amazon will be able to provide you better overall uptime and fault tolerance than you yourself could.

2. Calculate the cost of moving to the Oregon AWS datacenter.

3. Reassure your investors that outsourcing non-core competencies is still the way to go.

4. Try to, er, control, your inner control-freak.

;)


> we can actually get into the datacenter to fix it.

But better and faster than Amazon?

I'd rather spend three hours at home saying "Shit. Well, we'll just wait for Amazon to fix that", than dropping my dinner, driving to the datacenter, and spend three hours setting up a new instance and restoring from backup.


When our datacenter is down (cause of our last two outages), we can actually get into the datacenter ... and watch them fix it. Or not.


There's not much sweating to do, as it always comes up relatively quickly.


How long was Amazon AWS "degraded" today?


2 minutes if you checked the "multi-AZ" box on your RDS instances or ELBs.


For added fun, their EC2 console is down. I got this for a while:

  <html><body><b>Http/1.1 Service Unavailable</b></body> </html>
... then an empty console saying "loading" for the last 20 minutes. Then recently it upgraded to saying "Request limit exceeded." in place of the loading message (because hey, I'd refreshed the page four times over the course of 20 minutes).

On the upside, their status page shows all green lights.


They've acknowledged that at http://status.aws.amazon.com/ for awhile now with a tiny "i" status icon (I can't load my instances pane):

12:07 PM PDT We are experiencing elevated error rates with the EC2 Management Console.


Amazon misrepresents reality on this status page!

They have standardized icons to represent various levels of issues (orange = perf issue, red = disruption). But they don't even use them. Instead they add [i] to the green icons to indicate perf issues (Amazon Elastic Compute Cloud - N. Virginia) and disruptions of service (Amazon Relational Database Service - N. Virginia).

Maybe this status page is controlled by marketing bozos who want to pretend the situation is not so bad.


Does Amazon explain anywhere what regions, availability zones or whatever have single points of failure for each of their services? I guess that's supposed to be what an "availability zone" is but somehow that doesn't quite seem to capture it. It's pretty hard to build reliable apps without knowing where the single points of failure are in the underlying infrastructure.


All of our EC2 hosts appear to be functioning fine, but they can't connect to their RDS instance which renders our app useless. If you scroll down the page you'll also see that RDS instances are having connectivity issues. Not sure if it's related but for RDS users the impact is far worse.

EDIT: We are also using multi-AZ RDS, so either Amazon's claims for multi-AZ are bs, or their claims that this is only impacting a single zone is bs.


Because RDS is built on EBS, any slight issue with EBS manifests itself as a nastier issue for RDS.

Interestingly, EBS will never return an I/O error up to the attached OS, which is likely a good decision as most OSes choke on disk errors. What this means, however, is that if something even get slow within EBS (let alone stuck), applications that are dependent on it will suffer. Most of these applications (such as databases) have connection/response timeouts for their clients, so while EBS might just be running slowly, a service like RDS will throw up connection errors instead of waiting even a bit more.

You can imagine the cascading errors that might result from such a situation (instance looks dead, start failover...etc)


Our multi-zoned RDS instance was able to fail over to another zone with minimal downtime. It took about 2 minutes.


Lucky you? :)


Our multi-AZ RDS instance did not failover correctly this time although it has in the past..


same here, no fail-over


I had one multi-az failover correctly, however the security group was refusing connections to the web servers ec2 security group. I had to manually add in the private ips of the ec2 instances. It appears the API issue is affecting security group to ip lookups.


Same here, our servers are still up, but our RDS instance is down.


Per the linked dashboard, some instances in a single AZ in a single Region are having storage issues. Calling EC2 "down" is a bit dramatic, provided AMZN are being sincere with their status reports. Any system that can competently fail over to another AZ will be unaffected.


I would agree with you, but Amazon is just downright dishonest in their reports, which makes me sad, because I love Amazon. Go look at the past reports, they've never shown a red market, only "degraded performance" even when services for multiple availability zones went down at the same time due to their power outage (so had you architected to multiple AZs you were still fucked). When they have a single AZ go down, they won't even give it a yellow marker on the status page, they'll just put a footnote on a green marker. It makes their status dashboard pretty much useless for at a glance checking (why even have colors if they don't mean anything?)

Read their report from the major outage earlier this year, they start out by saying "elevated error rates", when many services were in fact down, and it wasn't until hours later they finally admitted to having an issue that affected more than just one availability zone.

From Forbes: ”We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.” By 11:49 EST, it reported that, ”Power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online.” But by 12:20 EST the outage continued, “We are continuing to work to bring the instances and volumes back online. In addition, EC2 and EBS APIs are currently experiencing elevated error rates.” At 12:54 AM EST, AWS reported that “EC2 and EBS APIs are once again operating normally. We are continuing to recover impacted instances and volumes.”


It's like grade inflation. You can never give out an F (Mr. Admissions officer, are you so bad at your job that you would admit such an unqualified student?), so a Gentleman's C is handed around. In Amazon's case, it's a gentleman's B+ (green, with an info icon).

A: fine A-: problems B+: servers are on fire

I really like Amazon as a company, use a lot of their services, but this is dishonest.


They reserve the red mark for "The world just exploded". Orange is reserved for "The datacenter got bombed".


Amazon is downplaying the problem. It is affecting many large sites.


Many large sites which aren't properly architecting their infrastructure to deal with the commodity nature of the cloud.


http://downrightnow.com/netflix

If netflix is down, then it's something most companies who know how to design fail over can't cope with.


The last Netflix post mortem mentioned they had a bug in their configuration where they kept sending traffic to already down ELB instances, which was the cause of the last outage for them if I remember correctly.


AirBnB is also down.


Yeap, it's down the day I really need it!


Not necessarily, it could be some element of the Netflix architecture that due to their size and/or design trade-offs has taken longer / is harder to eliminate than it would be for others.

Other services, like Twilio, have come through several of these major problems with US-EAST generally unscathed while Netflix has had issues repeatedly.


Netflix seems to be operating just fine for me in the southwest USA.


According to a site which doesn't document what its reports are based on. Given that Netflix worked for me during that period, I'm suspicious that downrightnow might be using EBS somewhere.


I'm seeing issues across multiple AZ's. Everything is still up, but that doesn't stop me from getting paged.


Including companies like Amazon that aren't properly architecting their control plane.


They should outsource it.


too bad their own dashboard and management interface don't fail over... http://db.tt/BcuoSnPu


I'm also having API and Console timeouts, so I would disagree that it's limited to a single AZ.


The API and Console timeouts are likely due to high load from everyone logging in to see what's going on.


Heh, the cynical side of me would like to point out that this is a great way to get people to stop talking about wiping a user's kindle. :-)


This kind of thinking is poisonous. I know it's in good fun and it's fun to look for connections in things, but it is actually preposterous to think that Amazon would purposely disrupt wide swaths of highly paying customers for much of a day to bury one story about bad customer relations. My guess is there are a lot of people working very hard to try and solve this problem right now, let's not belittle their efforts because of a conspiracy, let's belittle their efforts because of bad systems design.


It was a joke. I made the same joke earlier today. No-one is seriously going to believe this.


He knows you were joking. It's a bad joke. "It's a joke" is not a magic bullet that means you can do no wrong.

Poisonous ideas spread as jokes. That is one of the ways they spread. A person thinking well about the issue wouldn't find the joke funny because it doesn't make sense. The joke relies on some poisonous, bad thinking to be understood. It has bad assumptions, and a bad way of looking at the world, built in.


Let me explain to you why it is a funny joke. It is funny because it involves Amazon undertaking massive technical measures, with huge reputational damage in order to try to kill a story which is primarily not spreading via Amazon-hosted sites anyway.

It's akin to a man with athlete's foot deciding to remedy it by discharging a shotgun into his leg.


It's easy to understand shooting a leg with a shotgun. That's a simple thing.

The Amazon thing in question is far more complicated, and far harder to understand.

Thinking they are "akin" is a mistake. It shows you're thinking about it wrong and failing to recognize how completely different they are.

One isn't going to confuse anyone or be misunderstood, the other will confuse most of the population and be misunderstood by most people.

One, if someone misunderstood, only involves one individual being an idiot. The other involves a large company being evil and thus can help feed conspiracy theories.

I'm not sure if you are aware of the difficulty of Amazon doing this. Suppose Jeff Bezos wants to do it. He can't simply order people to do it because they will refuse and leak it to the media and he'll look really bad and then he'll definitely have to make sure to try super hard for there to be no outages anytime soon.

Shooting yourself in the foot is stupid but easy. Doing this is stupid and essentially impossible. To think it's possible requires thinking that Amazon has a culture of unthinking obedience, or has an evil culture that all new hires are told about and don't leak to the media. Totally different.

Casually talking about impossible, evil conspiracies by big business, as if they are even possible, is a serious slander against those businesses, capitalism, and logic. Slandering a bunch of really good things -- especially ones that approximately 50% of US voters want to harm -- and then saying "it's just a joke, it's funny" is bad.


Casually joking about impossible, evil conspiracies by big business on the other hand is something completely different and also funny.

No one will believe its related and its certainly not slander to joke about it. Also you might want to leave the political opinions out of hacker news... there is no 50% of US who dislikes those things, they only have different ideas about how to support it.


I think it was a pretty good joke, it's funny specifically because amazon obviously won't be doing it deliberately, no poisonous thinking required.


HNers are extremely bad at getting jokes. See http://news.ycombinator.com/item?id=4677335


actually, there's probably some few idiots who will believe it.


I'd like to think the smiley face would make that obvious, but I guess not.


When I commented the two other comments replying to yours were taking it seriously, so I decided to take it seriously too. Text is a pretty bad medium for hearing tone, sorry I misinterpreted yours; for some reason the "the cynical side of me" bit made me think your comment wasn't entirely in jest.

Cheers


Well if that was their goal then it has been a success as they have taken out a lot of the forums and discussion sites which were critical of them.

Except HN.


Because reddit and ycombinator are the center of the tech sociosphere?


For the less specialized groups, kind of, yeah.


Unfortunate.

I have to deal with a number of folks who will be overjoyed to read this news when their tech cartel vendor of choice forwards it this evening.

There's a huge contingent of currently endangered infrastructure folks (and vendors who feed off them) out there who throw a party every time AWS has a visible outage.


AWS sucking at availability (and especially specific parts like EBS, and then services built on top of EBS and on top of AWS) doesn't mean the correct option is to mine your own sand and dig up your own coal to run servers in your own datacenters in the basement.

Even if you're totally sold on the cloud, you can still have a requirement that things be transparent all the way down. AWS is one of the least transparent hosting options around.

If you're a customer of a regular colo, or even a managed hosting provider themselves based at a colo, it's pretty easy to dig into how the infrastructure is set up, identify areas where you need your own redundancy, etc. Essentially impossible within AWS -- there is no reason intelligible to me that ELB in multiple AZs should depend on EBS in a single AZ, but that's how they have it set up.


I'd question if they are truly wrong to send out news like this? Basically you have to weigh the realities of the 'cloud' with the perception.

The perception that many people has is that somehow the 'cloud' is a magical up-time device that will save you money in droves.

The reality is that for many companies with mid-level traffic and aren't a start-up with a billion users, the 'old' style tech might very well be your best option.


> who throw a party every time AWS has a visible outage.

At the going rate they'll be AA members before long.


This would be a great time to post a guide to architecting systems for failover using AWS. Anyone got a great guide?


We (Twilio) have released a number of articles & presentations in this area:

http://www.slideshare.net/twilio/highavailability-infrastruc...

http://www.twilio.com/engineering/2011/04/22/why-twilio-wasn...

It's strategy as opposed to how-to but the principles apply.


And Netflix's Simian Army is awesome:

http://techblog.netflix.com/2011/07/netflix-simian-army.html

They've even released the "Chaos Monkey" open source: http://techblog.netflix.com/2012/07/chaos-monkey-released-in...


Yet netflix is currently down, so maybe there is a problem with the chimp army?


The problem is likely the same as usual: if the damn control plane is down, it doesn't matter how robust your failover architecture is, because your requests to bring up new machines go unanswered.

There's pretty much no way to architect around that one as an AWS user (apart from going fully multi-cloud, but "nobody" actually does that, at least at scale), and I'm kind of shocked that those bits of AWS are still not robust against "single AZ outages", given that they're involved in pretty much every one of these incidents and make them affect people on the entire cloud...


> apart from going fully multi-cloud, but "nobody" actually does that, at least at scale

Pirate Bay might disagree with that sentence: http://torrentfreak.com/pirate-bay-moves-to-the-cloud-become...


Apparently iCloud is multi cloud (AWS and Azure).

But regardless it's not like all of EC2 went down just one or two AZs. So why couldn't traffic be migrated transprently to other AZs/regions ?


Netflix not down for me here (Eastern US).


I'm watching Netflix from Austin, so it's not entirely down in the US.


Sweet, thanks for sharing!


We (Netflix) have done a bunch of presentations on it which are on our slideshare page and across the internet.

After this issue is over I can give a longer answer. In short, we've just evacuated the affected zone and are mostly recovered.


Netflix recently launched here, and I've been unaffected so far.

And +1 for the slideshare page.

Their techblog is also worth following: http://techblog.netflix.com/


I'm guessing your Chaos Gorilla helped to harden your architecture against this threat.

Since you've mostly recovered, how did your system do? Are there side-cases that Chaos Gorilla didn't touch?


EDIT: I WAS WRONG. Chaos Monkey and Chaos Gorilla both exist and simulate different forms of chaos.


"Create More Failures

Currently, Netflix uses a service called "Chaos Monkey" to simulate service failure. Basically, Chaos Monkey is a service that kills other services. We run this service because we want engineering teams to be used to a constant level of failure in the cloud. Services should automatically recover without any manual intervention. We don't however, simulate what happens when an entire AZ goes down and therefore we haven't engineered our systems to automatically deal with those sorts of failures. Internally we are having discussions about doing that and people are already starting to call this service "Chaos Gorilla"."

http://techblog.netflix.com/2011_04_01_archive.html


My apologies! I was wrong.


Chaos Monkey takes down instances and such. Chaos Gorilla takes down entire AZs. :)


Monkey takes down single hosts; Gorilla simulates the loss of an entire AZ.


I'd love to hear that longer answer, fwiw!


I wish there was. Architecting a system for failover requires a mindset change, a culture change within your org, and the right technology that lets you build for it while not slowing yourself down too much. The last point being the hardest.

I would argue that none of the common full stack frameworks that startups use are fault tolerant enough for AWS. Most of them have multiple failure points that can quickly bring down entire apps.


PagerDuty's CTO gave a talk that mentions it: http://blog.pagerduty.com/2012/10/ensuring-the-call-goes-out...


+1


The reference architectures and white papers here are helpful: http://aws.amazon.com/architecture/


Our app is down because it's hosted on Heroku and it's frustrating because it seems like N Virginia is the least reliable Amazon datacenter. Every year it seems to go down a couple times for at least a couple hours.

Heroku should offer a choice between N Virginia and Oregon hosting (I think they're almost comparable in price nowadays). That way people who want more uptime/reliability can choose Oregon. Sure it will be further from Europe (but then it will be closer to Asia) and people can make that choice on their own.

But basing an entire hosting service on N Virginia doesn't make sense anymore, considering the history of major downtime in that region.


Or better yet, Heroku should offer an add-on "instant failover" service that, for a premium of course, offers a package for multi-site (or, knowing they're 100% AWS, multi-datacenter) deployment with all of the best practices, etc. Seems like a logical next step for them (or a competitor) given the recent spate of outages.


Can't update my list fast enough, but other major services experiencing problems are Netflix and Pinterest. Lots of other (smaller) sites are starting to fail too.

http://www.forbes.com/sites/kellyclay/2012/10/22/amazon-aws-...


Why the fuck do the two most critical services (ELB and Console) have depends on their historically most unreliable pile of shit (EBS)?


Seriously THIS has to be addressed.

I can tolerate EC2/EBS going down but why on earth is ELB/Console always going down at the same time ?


One of the most frustrating issues here is that we have to deal with Amazon's status page for information. It's a complex page, divided by continent instead of region, which means at least 5 or 6 clicks to figure out progress. They should learn from these issues about how people want to be informed - to date, they haven't. Also, they have a twitter account, which would be the perfect fit to keep everyone up-to-date with what's going on; to at least show a human side to these issues. Alas, they're not updating that either.

I've been working with AWS since early 2006 when they first launched - I was lucky to be granted a VIP invite to try out EC2 before everyone else, and ended up launching the first public app on EC2. This might be the first time when frustration has overcome my love for these guys.


It's degraded performance for some EBS volumes in a single availability zone - isn't this title a bit sensationalistic?


"Degraded performance," is a fairly off-handed way of putting what they're experiencing. First RDS connectivity went down the tube, and then EBS followed, finally the EC2 console is failing to operate properly (for me, in US-EAST region at least).

Of course, as soon as I read the report that the issue was confined to one AZ, I looked to move my server over to another AZ. Oh yes, two were full and refused new instances, and then surprisingly, new requests for the other AZs never were received or operated on - and now the console is failing. It's a bit more than just "slow EBS" if that's what you were thinking.

- edit: said ec2 twice in the second sentence, corrected to say ebs.


No. Amazon just understates the issues. Reddit, Heroku etc. all have problems.


Even if they do understate the issue, It's limited to US-EAST 1, and is an EBS issue. Saying that EC2 is "down" because of this is totally off the mark - I've got dozens of EBS volmes in EAST 1 that are unaffected, plus all of the other zones that are operating normally...


EC2 instances are affected by the EBS issues. But you're right, it's not correct that EC2 is "down".


Not all instances are EBS backed.


EC2 is not down. My EC2 sites are all functioning.


All my instances across 3 zones are functioning normally.


From what I'm seeing, if your root disk is on EBS and your SSH keys are there, you cannot SSH into those hosts right now.

Also, the availability zones are disparate in terms of what they can support. A great number of my instances are in 1d because of unavailability in others.


Note that the region-1[a b c d & such] designations are randomized per account; my us-east-1d won't (necessarily) be the same as yours or anyone else's.


Ah, I didn't know that. Thanks!


If you're relying on individual servers to have drives which never fail you're doing reliability wrong.


Does anyone know if this really is just one AZ? Seems like an awful lot of larger sites are down. I'd expect at least some of them to be multi-AZ by now.


I'm multi-az and am being impacted. I also use multi-az RDS adn its being impacted. So calling BS on the 1 az impact.


One of our Multi-AZ RDS instances failed over successfully. It was down for about 5-10 minutes.


Looks like multi-AZ to us too


Multiple AZs here.


This is affecting more than a single Availability Zone, but probably for reasons that have been seen before. One reason might be that an EBS failure in one AZ triggers a spike in EBS activity in other AZs which overwhelms EBS. (I believe this is what happened in April 2011).

Does anybody have any experience with migrating to Oregon or N. California in terms of speed and latency?


My dashboard says "Degraded EBS performance in a single Availability Zone". It then lists each of the five zones as "Availability zone is operating normally." http://cl.ly/image/202F3B0I371g


I was seeing that too, but it looks like Amazon has now updated the availability zone status. When I run ec2-describe-availability-zones from the command line, it's telling me that us-east-1b is impaired. (Availability zone mapping is different for each account, so my 1b may be some other availability zone for you.)


CloudApp seems to have trouble too


I'm sorry, am I in a time machine? Didn't this exact thing happen last year, and didn't these exact people explain exactly how they were going to make sure it never happens again?

What the hell is happening here?


Ironically, on the first hacking health event, our server was running heroku.. and heroku went down. 3 months later, we have the second hacking health in Toronto and all AWS is going down.


ELB is also fucked. Seen that nobody mentions it. Some of our load balancers are completely unreachable.

Just one multi-AZ RDS instance claimed the automatic failover. However, the 200+ alerts over the automatic failover due to internal DNS changes to point to the new master shows that things aren't as easily described by the RDS DB Events log.

Some instances reported high disk I/O (EBS backed) via the New Relic agent (the console still has some issues).

So far, this is what I see from my side.


I reacted quickly enough to get a new instance spun up and added it to my site's load balancer, but the load balancer is failing to recognize the new instance and send traffic to it... yay. Console is totally unresponsive on the load balancer's "instances" tab. If I point my browser at the ec2 public DNS of the new instance it seems to be running just fine. So much for the load balancer being helpful.


Now is a good time to stroke your dev pair :)

http://news.ycombinator.com/item?id=4680178


Maybe I'm being daft. Why does this sh*t never seem to affect www.amazon.com


Amazon doesn't run their site off the same set of EC2 servers and systems that everyone else does. They just used their experience to build it.


Amazon runs all of the Amazon.com web servers on EC2 hosts (and has since 2010: https://www.youtube.com/watch?feature=player_detailpage&...). They've just made sure that they have enough hosts in each of the other AZs to withstand an outage.


That would be the ultimate testament, maybe the only stunt they could pull to PR themselves out of this mess.. if www.amazon.com ran on the same set of EC2 servers and systems as us muggles


Update from http://status.aws.amazon.com/:

2:20 PM PDT We've now restored performance for about half of the volumes that experienced issues. Instances that were attached to these recovered volumes are recovering. We're continuing to work on restoring availability and performance for the volumes that are still degraded.

We also want to add some detail around what customers using ELB may have experienced. Customers with ELBs running in only the affected Availability Zone may be experiencing elevated error rates and customers may not be able to create new ELBs in the affected Availability Zone. For customers with multi-AZ ELBs, traffic was shifted away from the affected Availability Zone early in this event and they should not be seeing impact at this time.


This is infuriating. Our instances aren't ELB-backed, so we have an entire AZ sitting idle while the rest of our instances are overwhelmed with traffic. Why did they make this decision for us?

I'm normally an Amazon apologist when outages happen, but this is absolutely ridiculous.


This reminds me of an ad campaign we sometimes run on reddit:

http://www.reddit.com/comments/hg9oa/your_platform_is_on_aws...

... which you'll be able to click as soon as reddit is back up :)


Countly mobile analytics platform servers are also affected from this issue (http://count.ly). Waiting and waiting. :)

@caseysoftware thanks for the links, we now have something to read and implement this week.


You have a virtualized platform, on top of which are many pieces like load-balancing, EBS, RDS, the control-plane itself, etc.

You have burstable network connections which by their nature, will have hard limits (you can't burst above 10Gbps on a 10Gbps pipe, for example; even assuming the host machine is connected to a 10Gbps port).

Burstable (meaning quite frankly, over-provisioned) disk and CPU resources.

And if any piece fails, you may well have downtime...

It is always surprising to me, that people feel that layering complexity upon complexity, will result in greater reliability.


Hosted on heroku with a hosted heroku postgres dev instance, I observed a drop in the db response time :

- from a consistent average of 10ms over the last week

- to a new consistent average of ~2.5ms

beetwen 17:31 and 17:35 UTC. AWS started to report the current issue at 17:38 UTC. My app then experienced some intermittent issues (reported by newrelic pinging it every 30s). Don't know if it's related, could it be some sort of hw upgrade that went wrong ?

I did a push affecting my most used queries but that was one and a half hour sooner, at 15:56 UTC, so probably unrelated.


I thought at least Reddit had learned the "don't use EBS" rule from past outages - I was bitten by the April/2011 outage and switched everything over to instance storage shortly thereafter.

For most applications I think architecting EBS out should be straightforward - instance storage doesn't put a huge single-point-of-failure in the middle of your app if you're doing replication, failover, and backups properly.

And EBS seems to be the biggest component of the recent AWS failures upon which they've built a lot of their other systems.


The Console itself is down, <html><body><b>Http/1.1 Service Unavailable</b></body> </html>

Its high time AWS does something now.


Noticed that Rackspace was down for a bit - I presumed they're getting hammered on their WWW as people are considering an infrastructure switch?


I jokingly said the internet was broken this morning as I've been having weird connectivity issues all morning. Funny that I was sort of right.


Dropbox runs on S3, they are unaffected? And Heroku's website seems okay. Only reddit is down. That's because of this?


I am frankly surprised so many still use EC2 considering how frequently it breaks. It's not cheap, so the only reasons to use it would be reliability or scale right? Why not just get lots of boxes at Hetzner and OVH (40 EUR a month for 32GB of RAM and 4x2x3.4GHz cores) and scale up / redundansize that way?


I suppose this explains why Airbnb is down now, as well: http://aws.amazon.com/solutions/case-studies/airbnb/

I was just in the middle of booking a stay in a Palo Alto startup embassy for this week, too!


They've been down for several hours actually. I tried to get to their site this morning to check out the jedburg talk, and got the "We're on the case" message.


seems they should have a back up


Building infrastructure on top of Amazon with global replication across multiple availability zones can sidestep such failures and guarantee uninterrupted operations to customers - www.pubnub.com is unaffected by AWS being down.


Not wanting to shill for Google, but we've been very happy with Appengine's reliability - particularly since moving to the HR datastore. Have other Appengine users had any significant downtime apart from MS datastore issues?


What site are you running on App Engine?


I made this to express my outrage and hope to help others easily express theirs too! http://robbiet480.github.com/StopLyingToUsAmazon/


isn't it about time they supported a multi-regional failover system?


Coursera is also down.


FWIW here's Google App Engine status page: http://code.google.com/status/appengine


Two of our four EC2 servers are down. Both have RDS connections and EBS storage. Looks like the RDS connection has something to do with it...


RDS is built on EBS, so if EBS has problems you can bet that RDS will feel the effects.


Here comes another round of Hacker News posts about how you should never host your app in just one availability zone...


How does Twilio manage to stay up when AWS goes down, and how much of that can the average developer reasonably do?


See Keith's (of Twilio) response here: http://news.ycombinator.com/item?id=4684766


Nice, thanks!


My site is currently down because of this.


This is why you should think about a multi region architecture to prevent these types of service interruptions.


i read in my book that amazon is the best cloud service provider and seeing this thing can't really prove it. I am still confused which cloud service i use ? Amazon is only one which Indians are currently rely on ,since latency is bit smaller for amazon servers as compared to other providers.


Adding insult to injury: when trying to view instances in the console I now only see "Rate limit exceeded".


Disappointing, but I guess not totally unexpected. Our prod site was down for a bit with many others.


These semi-regular outages are why I am going to use the Google Cloud Platform for my next project.


a ton of other stuff is down too... www.reddit.com, netflix, www.freewebsite.com and a bunch more


As of now , I don't have a problem with my site. Elastic beanstalk with ec2 small instances.


Isn't it worrying that there are so many services/sites that are so dependent on Amazon EC?


Thanks I have a friend whose site is down from this and this helps shed some light on it.


EIP remapping is all messed up as well. api errors. console for it does not work either.


I just recently signed up for a EC2 (3 weeks ago) and my small site is also down. :-(


Perfectly timed with our product launch, now our site is down too - www.zenwheels.com


My instances just came back up.


The tech industry is manufacturing its version of "Too big to fail" entity.


The emperor has no clothes!


I wonder if this is going to be an attack or hardware/software failure..


Does this just affect normal EBS or EBS with provisioned IOPS as well?


At least HN isn't down! :D


When is Google Compute Engine supposed to be open to everyone?


There's Azure.


Somewhere Steve Wozniak is saying "I told you so"


Some of our sites are affected. Good times!


Time to switch to Google's Compute Engine.


I fixed EC2 using this one weird trick..


Woot, Back Online www.zenwheels.com


Sigh, my heroku site is down.


Single point of failure?


It's back up now.


just checked my site and it is up now!



back up :)


As far as I can see Reddit is still down.


This is the kind of stuff I think about when I hear people talk about the cloud and promise that downtime is a thing of the past.

Cloud hosting is not drastically different from any other type of service and is still vulnerable to the same problems.


If you go through the pains of architecting your system to span multiple AZs, or you avoid using EBS, then you probably dodge most of the EC2 outages. (Remains to be seen if that is the case here.)

That said, I don't think most people think using the cloud means that downtime is a thing of the past. I think the more attractive proposition is when hardware breaks, or meteors hit the datacenter, etc, it is their problem, not yours. You still have to deal with software-level operations, but hardware-level operations is effectively outsourced. The question is if you think you can do a better job than Amazon -- some companies think they can, most startups know they can't.


Yeah. Even with this, they still do better than I would. My record: misconfigured air-conditioning unit alarm leading to servers being baked at high temperature over a weekend, leading to much wailing and gnashing of teeth. I now know to be really careful to set up air conditioning units properly, but what other lessons am I still waiting to learn? The main lesson that I took from this is that I should stick to what I am good at: cutting code & chewing data. :-)


Yeah this is another important point. Part of the cost of AWS is also a bit of an insurance policy against you physically breaking your servers :)


I always understood the cloud to mean a black box of sorts that automatically handles failover, among other things. The cloud being just a fuzzy representation of the infrastructure.

S3 probably fits the description of a cloud service. You send your data, and the service worries about making it redundant without your intervention. If data in NE USA is unavailable, the service will automatically serve you the data from somewhere else. You don't need to know how it works.

EC2 and some of these other building blocks, however, I would not consider to be cloud services. Merely tools for building out your own cloud services to other customers who then shouldn't have to think about failover and other such concerns.

If you know you are using a server that is physically located in a certain geographic location, it need not be represented by a cloud. It is a distinct point on the network.


Workplace productivity is going to skyrocket today


WAT!?


Worst infrastructure ever.


[deleted]


Did you try their status page? https://status.heroku.com/


There have been multiple tweets mirroring the status page: http://twitter.com/herokustatus


They've given both status updates and tweets. So much for actually reading first.


This is why I trust Google's data centers http://www.google.com/about/datacenters/gallery/#/


HN really needs to hold a "dumbest comments of the year" award.

This has to be a strong contender.


HN really needs to hold a "snarkiest response of the year" award.


Snap!


pg, or someone else from HN... Could you please edit this title for accuracy? Maybe, "Poorly designed sites taken out because of problems in one Amazon availability zone."


That's editorializing.

For many sites, a single server in a single zone (e.g., a non redundant server, an instance, a slice, a VM, whatever) is the right decision for ROI.

For many sites, the money spent on redundancy could be better spent on, say, Google Adwords, until they're big enough that a couple hours downtime has irreplaceable costs higher than the added costs of redundancy (dev, hosting, admin) for a year.


Yes, it is editorializing. My point is, but I guess too subtle, that the current link text is very much an editorial comment, especially since the content at that link location has nothing to do with the sites mentioned in the link text.


This issue is affecting both an EC2 Zone and Amazon's RDS servers, which are technically multi-zone. There are a ton of well-architected apps and sites that have been affected. Unconstructive...


Tons has been written about Netflix's architecture, and they're down as well. I believe Amazon isn't being totally transparent.


Personally, starting to believe the Amazon is quickly becoming public enemy #2.


and who is #1?


Completely depends on the day, for myself. IMHO: I sum it all up as "Digital Rights."

-The right to not have your traffic limited, and controlled by ISP

-The right to purchase a non DRM "file" and use it on your phone, computer, etc free of burned from some company

-Ability to install what ever you want on your $600+ device

*Edit for: formatting, and additional thought.


GOOG?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: