Heroku learns the hard way from Amazon EC2 outage

antirez · on Jan 12, 2010

> A 15-person start up like Heroku could never support its thousands of users for a measly few million in venture capital with traditional hosting

I wonder if this is true. If they work with 22 virtualized instances, maybe 10 good servers can provide more or less the same performances. Let's forget for a moment if EC2 is the way to go or not, and all the benefits, but I can't see how a company with a few millions can't afford running 10 big linux boxes instead to use EC2.

thaumaturgy · on Jan 12, 2010

I don't think it's true.

The article also mentioned $20,000 per month in hosting fees. The average T1 costs around $250/month, less if you're savvy. You need a physical location, because colo rates are still ridiculous IMO, so ... say another $2,000 / month, which is plenty for a small office in a warehouse district somewhere.

Even after lots of other goodies, you'd still have, let's say, $10,000 a month to spend on ... what? More machines? A bigger generator? Huge parties?

I don't think that "the cloud" is appropriate for certain applications -- and theirs would be one of them -- where you have large amounts of persistent data and large, sustained, ongoing bandwidth requirements.

Realize that, for what Amazon charges currently, you can buy an equivalent 1TB disk drive every single month. (Their monthly charges for storage alone, not including bandwidth, work out to a little more than the cost-per-GB for a good quality terabyte disk at current prices.)

nethergoat · on Jan 12, 2010

Another commenter addressed the point about 1TB drives. I'd like to help on the bandwidth side.

A T1 is simply not adequate for a web company for two reasons: 1) it is a high-impact single point of failure (any issue here results in 100% outage); 2) it is limited to 1.5mpbs. I guarantee you Heroku is beyond 1.5Mbps at peak.

To make this a fair comparison, you need to look at burstable bandwidth. Unless you want to manage your own redundant BGP routers and multiple carrier lines, you'll want blended bandwidth provided by your datacenter or a specialized company (InterNAP, etc). This bandwidth will run you $50-100/Mbps/mo (let's say $75) and is billed at 95th percentile of usage. A year and a half ago when my company's EC2 network transfer costs were ~$8,000/mo (with a total EC2 bill of ~$30,000/mo), I worked out a bandwidth need of ~60Mbps (at 95th%) - this would represent a bandwidth cost of $4500/mo. Let's assume Heroku's EC2 cost breakdown is similar, and that their needs are 2/3 of what ours were then - we're left with $3,000/mo for bandwidth.

Yes, you're still paying a premium, but the cost savings of colo hosting are quickly offset by the additional manpower required to maintain physical servers. Whatever difference still remains is easily justifiable given the amount of flexibility the cloud affords you. This is exponentially more important for a startup where agility can make or break you.

tlipcon · on Jan 12, 2010

If you're paying $50-$100/mbit at the 60mbps level, you're paying list price (aka getting ripped off). You should be paying $40 at most, and if you are any good at negotiating and making vendors compete, you can get them down to $15 or so. They're paying in the <$5 range, so they're still making some margins at $15, just a matter of how much they want your business.

moe · on Jan 12, 2010

Your math is flawed.

Realize that, for what Amazon charges currently, you can buy an equivalent 1TB disk drive every single month.

You forget that disks fail. You forget that you always have to buy them in pairs for availability. You forget that someone has to be there (physically) to swap them. You forget that storage arrays that can truly scale seamlessly (akin to the experience EBS gives you) cost a fortune, not to mention the accompanying fibre channel infrastructure. You forget that maintaining a large scale Xen deployment is non-trivial (hint: there are companies making a living only of that!).

In short, you forget many things that bite large chunks out of your imaginary generator/party budget.

thaumaturgy · on Jan 12, 2010

You're right, my math was flawed, though possibly not in the way you were implying, because:

I added a T1 ($250/month) to a building rental ($2000/month), and then said that they'd have $10,000/month left over.

Why would I do such a thing? Why, so that:

They could upgrade to redundant DS3s (or better!)

They could buy extra drives!

They could pay someone to live within a few miles and carry a cell phone that accepts text messages!

...etc.

Doubly funny, to me, is that "disk failure" is being mentioned in the comments on an article about how all of their disks failed all at once. So, no, I certainly did not forget about disk failure. Indeed it's one of the reasons that I'd rather not host certain things as EC2 instances.

As far as scalability: Well, how far would 67TB carry you? Because you could deploy 67TB every single month and still have money left over. [1]

To get to the meat of my argument: I think that what's both great and horrible about EC2-like services is that they allow programmers to deploy Really Big Things without having to employ or deal with system administrators.

As a result, it's a solution that's being misapplied in some cases.

[1]: http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-h... -- it even has pretty pictures

moe · on Jan 12, 2010

Well, I've been doing the infrastructure thing for a living for a while and it's always half amusing, half worrying to see comments like this.

So, no, I certainly did not forget about disk failure. Indeed it's one of the reasons that I'd rather not host certain things as EC2 instances.

I was mentioning disk failure as part of ongoing maintenance, which is a different thing to a once-in-a-year blackout that you (as the reseller of the service) don't have to debug and recover or actively protect against.

As far as scalability: Well, how far would 67TB carry you?

See, that's where the amusement starts. A service like heroku it would probably carry not very far. Their service is not about storage. The backblaze gives you 45 spindles for cheap. 45 spindles translate to roughly 4500 IOPS. That's a significant rate. It's also a rate that a few database intensive customers can exceed with ease.

Now you're suddenly in an entirely different game called "Storage Engineering". It's about things like balancing IO hotspots, redundant data paths and many other pesky little details that were not exactly your core business in first place.

You sure can scale out with backblaze, but it's rather unlikely that you can do it for cheaper than what amazon has on offer (economies of scale, y'know) because scaling IOPS is quite a bit harder than scaling capacity.

Here's a quiz for you: Why do you think none of the cloud providers other than amazon has an offer anywhere near the flexibility/price of EBS?

ericd · on Jan 12, 2010

I'm curious, how do the Intel X25-E's change the IOPS scaling challenge?

dlsspy · on Jan 12, 2010

> The average T1 costs around $250/month

What's the relevance of this T1? How many of the 45k apps do you think you can serve concurrently in 1.544Mbps?

I don't have as much bandwidth as most people I know, but I can't imagine having any issues saturating a T1 on my own.

moe · on Jan 12, 2010

Something seems to be off here. Their frontpage claims 45k running applications. There's no way to do that with only 22 instances. One of these figures must be wrong.

jamesheroku · on Jan 12, 2010

This article contains quite a bit of inaccurate information.

Despite the size of 22 double-XL instances, they were a small portion of our overall footprint in EC2; it takes well over that kind of capacity to run the platform.

Those instance types all happened to be in one availability zone for a variety of reasons. Our platform overall does not live in a single zone.

Losing machines is not a problem for us (we cycle them constantly, in fact). Normally losing even that many machines would not even be noticed by our customers; this was an unusual case in which several factors cascaded into a larger problem.

To be clear, this downtime (45 minutes or so, with full normal state by 90 minutes) was, unfortunately, our fault - not Amazon's. EC2 instances vaporizing is an expected part of using the service.

We've made a couple of operational changes that will prevent these issues in the future, and we sincerely apologize to any customers who were affected.

nethergoat · on Jan 12, 2010

You're right, that's not all they have. 22 is the number of m2.2xlarge nodes they run. There's no mention of how many other nodes they have running.

That being said, given how many toy, no-traffic apps are likely running (for free!) on the platform, I think we can safely assume there's a very high degree of multi-tenancy.

mark_l_watson · on Jan 12, 2010

I think that most of the 45K instances are small experiments that only get a few HTTP requests per day. I would guess that in this case an instance that does not get a request for a while that it gets 'swapped out.'

I have no firm data, but a customer last year put a very low volume web app on Heroku, and I guess this is exactly what we saw: the first request after a quiesant period took many seconds - then subsequent requests were quick. If I am right about this, customers with high volume web apps would not notice this.

anotherjesse · on Jan 12, 2010

m2.2xlarge are High-Memory Double Extra Large instances.

    34.2 GB of memory
    13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each)
    850 GB of instance storage
    64-bit platform
    I/O Performance: High

Which means each application has 16MB of ram. Does chroot (or whatever jail mechanism they use) allow shared memory for standard libraries?

houseabsolute · on Jan 12, 2010

I'm truly shocked that there was not one person in their crew with the capacity to realize that the very existence of an "availability zone" indicates that you should spread your resources across several of them. How many other "lessons" are waiting to pounce on them? They really need to cultivate their sense of paranoia if they plan to deliver consistent value to their clients.

aminuit · on Jan 12, 2010

You run a hosted application environment. You experience an outage that affects all of your customers. Do you a) admit that you encountered a problem that sysadmins have known about since the dawn of the Internet, or b) spin it as a novel web 2.0 problem ("unknown unknown", seriously?), solvable only by your crack team of cloud computing experts?

For those of us that still host our own servers, it's a little bit frustrating to see this spun as something new. It's not. If you design services that require high uptime, you incorporate the notion of losing a datacenter, which is effectively what happened here.

jgilliam · on Jan 12, 2010

My guess is the reason heroku hasn't done it yet is because of latency issues between availability zones. Ping times can be 6x between availability zones, vs. within it. http://orensol.com/2009/05/24/network-latency-inside-and-acr...

More and more cloud services are running on AWS deliberately so there is very low latency. I can host a SOLR index on one service, a MongoDB on another service, and still get reasonable performance.

houseabsolute · on Jan 12, 2010

Yeah, it's expected that you're going to get microsecond pings within the same building and millisecond pings to buildings across the country. I guess I can admit that they may have had their reasons, but they can't be good enough to justify this lapse. If it's as inexpensive to run this setup as the article says, they should be able to afford two completely separate stacks of servers, one in each availability zone. It's not going to be easy to figure out how to sync the data across each one, but the fact remains that you have to exist at n+1 at least if you're going to deliver reliability.

nethergoat · on Jan 12, 2010

Agreed, that's a whole lot of eggs in one basket. I'd be interested to hear their reason for not spreading across multiple availability zones. My first thought was that m2.2xlarge nodes, since they're relatively new, might still be restricted to a single AZ; on further investigation, however, this does not appear to be the case.

illumen · on Jan 12, 2010

"A server failing was normal, he said, but it was unheard of for a whole class of resources to suddenly vanish. "

Whole centers go out all the time, for lots of different reasons. Using multiple data centers from different providers is the only solution.

mbrubeck · on Jan 12, 2010

Agreed. Also: Hardware failures are (close to) statistically independent across separate instances. Software bugs are not. And when you run on a virtualized platform with a complex infrastructure, everything acts like software.

wastedbrains · on Jan 12, 2010

We had just deployed a new version of our app to Heroku as this happened. I thought our release was incredibly buggy and constantly crashing until I noticed Heroku's status update. They came back quickly, and for us it is a known risk of having our server in the cloud.