Tell HN: Heroku is Down (update: recovering as of 10PM PST)

michaelfairley · on June 15, 2012

The AWS status page[1] is showing problems for EC2 East as of a few minutes ago. This <strike>might be</strike> is a more widespread issue.

EDIT: Various non-Heroku EC2-East-based sites (e.g. Quora) seem to be down as well, lending more evidence to this being an EC2/EBS outage.

1: http://status.aws.amazon.com/

dangrossman · on June 15, 2012

Half the startup world is offline right now, yet the AWS status page is all greens with one "info" notice about EBS (which I doubt is the issue here). I'm glad to have moved off a fully AWS stack back to my own servers over a year ago. My uptime, bills, and stress level are all much improved.

res0nat0r · on June 15, 2012

Sounds like half of the startup world hasn't spent the time or the money to design their service to be redundant on top of AWS.

zeeg · on June 15, 2012

Fortunately for most, they dont have customers yet

reddit_clone · on June 15, 2012

That made me laugh. Am I a bad person?

rhizome · on June 15, 2012

The sales pitch is kind of that it's already redundant. You know, RAID and load balancers and all.

fusiongyro · on June 15, 2012

That's hardly surprising. The groupthink on cloud services is that they magically make everything scalable, cheap and easy. Making redundant software is hard work.

michaelfairley · on June 15, 2012

Definitely an EBS issue. Our (EBS backed, Heroku hosted) DB was returning queries more and more slowly leading up to the outage.

atlasom · on June 15, 2012

They posted an update on amazon: We continue to investigate this issue. We can confirm that there is both impact to volumes and instances in a single AZ in US-EAST-1 Region. We are also experiencing increased error rates and latencies on the EC2 APIs in the US-EAST-1 Region.

Also Amazon Relational Database Service (N. Virginia) is unavailible.

Seems like its snowballing.

res0nat0r · on June 15, 2012

RDS actually uses EBS so it is most likely the same underlying issue.

ajasmin · on June 15, 2012

Out of curiosity do we know if http://status.aws.amazon.com/ is hosted on AWS?

sprout · on June 15, 2012

It's a key design constraint of status.aws.amazon.com that it not depend on any AWS services. (Or so I've heard.)

ceejayoz · on June 15, 2012

status.aws.amazon.com resolves to 207.171.185.167 for me.

That IP isn't in the EC2 IP ranges (https://forums.aws.amazon.com/ann.jspa?annID=1528), so at the very least it's not hosted on EC2.

ctrand · on June 15, 2012

I thought I read somewhere that it was on Rackspace

bdesimone · on June 15, 2012

Dear Heroku -- I know it's my job to make sure my site is available (/thread). However, I think I speak for most enterprise customers when I say I will throw money at your company the second you come up with a multi-zone/highly-available.

ctrand · on June 15, 2012

I spoke to one of the guys from Heroku recently and he said they are working on it but he couldn't give me a date.

It couldn't come soon enough for us, but also for Heroku as AppFog seem to have its foundations built on a multi-zone/region/provider architecture.

zeeg · on June 15, 2012

Everyone should take a page from the book of Netflix right now. It's pretty embarrassing to be anyone that's entirely down and can't do a thing about it due to an EC2 outage.

How do you explain to your customers/users/etc that you were down and have absolutely no control of when you will be back online? How can you explain it to yourself?

Cushman · on June 15, 2012

"We're sorry about the current downtime. We know some of you are frustrated, so we thought we'd take a moment to explain why this happened.

"Running a web server is very expensive. After we've built the site, if we want to keep it running, someone needs to be on-call 24 hours a day. That means at least one full-time staff member who does nothing else-- more if we want them to stay sane.

"To save us and you some money, we've contracted maintenance of our servers out to a third-party service. This is great for us, since they run it more reliably than we could, and it's great for you, because it costs a lot less. But the downside is that things still break sometimes, and when they do, it's completely out of our hands. We're left waiting for things to get fixed just like you are.

"So we understand your frustration; we're frustrated too. But unfortunately, downtimes do happen. Guaranteeing our service 100% of the time would cost hundreds of thousands of extra dollars per year, and for most of our users, that's simply not worth the cost. Our provider guarantees 99.[nines]% reliability for much less money, and this is the 0.01%.

"If you have something that absolutely must get done, shoot us an email right now and we'll take care of it for you as soon as the site is active again.

"Although this is technically out of our hands, we aren't trying to shift the blame; we made this decision with open eyes, and we stand by that decision. Again, we sincerely apologize for the temporary inconvenience. We hope we can make it up to you with some new features we'll be rolling out this month :)"

It doesn't have to be that hard.

geoka9 · on June 15, 2012

Yes, it's actually easier to apologize to your customers when you have AWS to blame for the outage. After all, you're in good company ("even Heroku is down - what do you want from us?")

DeepDuh · on June 15, 2012

You're lucky if your average customer knows what AWS is, let alone Heroku.

geoka9 · on June 18, 2012

I'm sure your average customer has heard of Amazon :)

DeepDuh · on June 18, 2012

What are you telling me? Our servers are hosted by a book store? You must be joking!

harryh · on June 15, 2012

Because this shit is really hard especially when you're trying to build a product at the same time.

It's not like you can just wake up one day and say "I'm gonna go build a fully fault tolerant distributed system that works across multiple data centers!" and then you're done by the time you go to sleep.

Go actually talk to some Netflix engineers. They'll tell you the same thing.

arohner · on June 15, 2012

Yes, you're absolutely right. However, Netflix is distributed across multiple AZ, while Heroku has spent the last two years after their $212MM acquisition in the same AZ.

That makes it sound like Netflix has a more reliable platform than the PaaS company.

zeeg · on June 15, 2012

That's my point exactly. Everyone relies entirely (almost) on EC2 for mission critical business, and then they're left there with no outs as soon as a big outage occurs.

flyt · on June 15, 2012

this. stop talking about how great a MVP is and then complain when people haven't build multi region distributed services that are fault tolerant to major platform outages.

bcjordan · on June 15, 2012

Netflix's "Lessons Learned" from the April 2011 AWS outage: http://techblog.netflix.com/2011/04/lessons-netflix-learned-...

Though they're not 100% degradation proof today either: http://i.imgur.com/MJfqj.png

mnutt · on June 15, 2012

Maybe you tell them that they won't be able to trade cat pictures for some time. And that the outage happened because you didn't overinvest in infrastructure so that you could keep prices down and roll out some new features faster.

Even if you aren't a cat picture site, many startups can deal with some downtime occasionally and it's the right tradeoff to make.

tg3 · on June 15, 2012

Their uptime percentage is 99.97%, but I'm having a hard time fighting the recency effect that is telling me to get off the platform ASAP.

zred · on June 15, 2012

That was their uptime for May, but looking at June it's going to be a worse picture. They're already down to 99.63% if you only include "red" incidents and down to 99.25% if you include "amber" incidents as well (as of 39m of downtime for this latest incident and assuming they don't have any more downtime for the month).

andrewvc · on June 15, 2012

That's 99.97 for the last month.

Which is ridiculous, 9s for most services use years as standard. Of course if heroku did that they wouldn't look so good.

andrewcooke · on June 15, 2012

if you have a year of 99.97 months, don't you have a 99.97 year? how does having a shorter time make the numbers look better?

[edit: i realise they could be hiding worse months, but i don't think that's what the post i am replying to meant. perhaps i am reading it wrong.]

veemjeem · on June 15, 2012

They could have done it to the extreme -- show the numbers for the past hour. Then they could almost always report 100% uptime. And if they ever went down, wait an hour, then go back to reporting 100% uptime again.

hboon · on June 15, 2012

They could have terrible uptime 2 months ago and you wouldn't see it in their status page, because it only showed last month's.

jaggederest · on June 15, 2012

Parent post was saying that, without taking into account the entire year, you can have one month that is terrible and the rest quite good.

The measurement in question is supposed to be about consistency.

mparlane · on June 15, 2012

Depends on the month...

31 versus 28 days.

alaithea · on June 15, 2012

It doesn't matter whether gauged by the month or year. It's a percentage. If they have 99.97% uptime every month for a year, they'll still have 99.97% uptime for the year.

wmf · on June 15, 2012

Or if they have 100% for six months and then 99.97% for one month the long-term average would be better.

reuven · on June 15, 2012

I just got up (it's morning in Israel), and a client of mine in the US with a major, mission-critical application was screaming (rightly so) that things are down.

We're already looking into alternatives -- perhaps not leaving Heroku altogether, but certainly not depending on them 100 percent. There's no way that we can entrust the business to something that can just catastrophically fail at any moment. I've been running my own servers for years, and they've never had such unpredictable issues.

I increasingly have to think that a few servers, on different providers, with the application deployed via Capistrano, will be more fault-tolerant than Heroku. At least, it seems that way right now.

tkaemming · on June 15, 2012

> There's no way that we can entrust the business to something that can just catastrophically fail at any moment.

Anything, including service providers, can catastrophically fail at any moment. Fault-tolerant architectures are based on redundancy (including infrastructure provider redundancy, as you mention), not on "guaranteed" SLAs.

wmf · on June 15, 2012

Provider redundancy goes against the concept of PaaS IMO (ignoring the sci-fi future where there are multiple 100% compatible providers). Heroku needs to become internally redundant to really live up to its promise.

chrislorenz · on June 15, 2012

Dont overreact... heroku is using AWS which is down for many other sites right now other than heroku. This is an EBS issue that is occurring.

drivebyacct2 · on June 15, 2012

And my guess is his client doesn't give two excrements what the reason is. It's boolean to them; it's up or it's down.

damian2000 · on June 15, 2012

I know AWS EC2 does support multiple global regions and within each region multiple availability zones. You're supposed to host your application across more than one AZ if you want good fault tolerance. Not sure if Heroku uses this though.

sarchertech · on June 15, 2012

I've had so much more downtime with heroku/AWS than I ever had back when I was running my sites on slicehost.

I also feel like I've let my admin skills deteriorate because I've been dependent on heroku. Back when I was running everything myself, worst case scenario I could set up a new VPS from a backup in another datacenter. Now if heroku goes out I just have to twiddle my thumbs while I wait for updates.

toast76 · on June 15, 2012

...or you can keep on working knowing that your services will be magically back online without you lifting a finger.

zeeg · on June 15, 2012

except for any potential data loss that just happened

philip1209 · on June 15, 2012

I understand what you are saying. I host most of my projects on Rackspace. I've been intending to convert current ideas to Heroku because learning SysAdmin stuff doesn't seem to have much upside potential. However, these multiple outages while Rackspace (seemingly) keeps chugging along are discouraging. That being said, every Heroku outage makes front page of HN, and there could have been Rackspace downtimes much more often that I didn't notice.

Eh, I'll probably continue converting to Heroku and not look back.

Erwin · on June 15, 2012

I haven't see any RS-wide outages in my datacenter (IAD) for many years (I vaguely remember some networking issue, but that was something like 5 years back).

So they have their infrastructure (network, power) working very well.

The downtime we've had were our own unique hardware/software issues that come with a complex bare metal installation.

davidjohnstone · on June 15, 2012

Here's the Heroko status site: https://status.heroku.com/

jasongullickson · on June 15, 2012

Seems like someone posted a link to a project that made self-hosting a previously Heroku hosted site simple, but I can't find it now...

...would be cool if there was a Linux package (or distro) that you could boot-up and then just change your git remote to and have your app up-and-running on your own hardware.

jeffh · on June 15, 2012

This was in the recent Heroku down thread. You might be interested in Stackato (http://www.activestate.com/stackato). It is based on Cloud Foundry, with numerous enhancements, including support for Heroku buildpacks (http://docs.stackato.com/languages/buildpack.html). Heroku-in-a-box - give it a try.

BTW, while the point is to enable private paas, you won't get around the issues that hit sites like this without heeding all the warnings and recommendations about building in redundancy for high availability.

This was noted well in this post: http://www.newvem.com/blog/main/2012/06/aws-cloud-best-pract...

"""It is a lot cheaper to add 1% uptime to a 95% SLA than it is to add 0.09% to a 99.9% SLA. Cloud application vendors (SaaS) need to pay very close attention to the additional resources that are invested in order to support a 99.9XX…% uptime SLA, and perhaps build it into their pricing plans."""

jasongullickson · on June 15, 2012

Yeah I'm just evaluating options, if nothing else it would be excellent to have the equivalent of a "donut spare" on private hardware that we could throw on when events like this occur (although having done plenty of "self-hosted" sites during the first boom, I'm always running the numbers for either option).

zrail · on June 15, 2012

I made Dokuen a few weeks ago, maybe that's what you're thinking of? If you're willing to live with a few warts, it's working pretty well for my personal use. My blog is still on Heroku so you can't really read about it now, but you can check out the code.

https://github.com/peterkeen/dokuen

whalesalad · on June 15, 2012

I saw this a while ago and will actually be using it very shortly to deploy 3-4 internal apps on our own mini cloud. I love Heroku and can't stand all of the open source "alternatives" like Cloud Foundry. Yours is exactly what I wanted and I can't wait to really start using it and contributing via github! Thanks again!

jasongullickson · on June 15, 2012

This looks very interesting; thanks for the link!

jasongullickson · on June 15, 2012

This is what I was thinking of (I think):

http://cloudfoundry.org/

whalesalad · on June 15, 2012

If Heroku moved off of AWS they'd have better uptime and lower prices.

ajasmin · on June 15, 2012

I'm sure that some Heroku customers already depend on the fact they're hosted on EC2 for accessing other instances or AWS services.

ceejayoz · on June 15, 2012

That'd depend where they moved to, wouldn't it?

whalesalad · on June 15, 2012

At their size they need to be running and managing their own hardware. I'd use them more if it wasn't hundreds of dollars a month to host a few apps that might not be up when my customers are.

atlasom · on June 15, 2012

Can we get this title renamed to AWS outage as the problem is not Heroku.

biot · on June 15, 2012

It may be an AWS outage as the cause, but ultimately it's Heroku's problem. They're the ones touting:

  "Erosion-resistant architecture.

   Heroku takes full responsibility for your app's health,
   keeping it up and running through thick and thin..."

The thick happened today, something eroded, and people are holding them to their word: full responsibility for their app's health. Heroku could do load balancing between multiple independent providers rather than be solely dependent on [one region of?] AWS.

itos84 · on June 15, 2012

Another update from Amazon: "9:55 PM PDT We have identified the issue and are currently working to bring effected instances and volumes in the impacted Availability Zone back online. We continue to see increased API error rates and latencies in the US-East-1 Region." Been thinking that maybe most startups are seeing that cloud computing is the most reliable way to go, but today I'm reconsidering having another type of backup server. Just hope there is no data loss in the apps.

structAnkit · on June 15, 2012

Pocket also tweeted that they are having issues due to Amazon's issues.

https://twitter.com/pocket/status/213481664670732288

vegasbrianc · on June 15, 2012

Was at a Heroku "crash course" last week where they claimed they learned from their major outage in March/April Here is the link to the videos from the conference http://zurichtechtalks.tumblr.com/post/24670375315/heroku-in...

ceejayoz · on June 15, 2012

One of mine went down, and I'm seeing folks on Twitter saying the same. Definitely something going on.

*edit: Came up 12:42 AM ET.

reustle · on June 15, 2012

A few of my us-east-1d machines are down.

base698 · on June 15, 2012

One of mine is as well. The load balancers also seem to be haywire.

jordanthoms · on June 15, 2012

Related: Right now when I try to cat /proc/mdstat or use mdstat to look at my RAID status, it just hangs. Seems it's trying to contact the EBS volumes and it's just failing. Any way to actually view my raid status?

dangrossman · on June 15, 2012

It's probably best not to muck around with the RAID right now when the drives it thinks are there aren't actually there. If it were me, I wouldn't touch anything until Amazon fixes itself.

jordanthoms · on June 15, 2012

It's more a theoretical question than anything else - Say this outage had gone on for days, I'd have needed to be able to see which volumes have failed and drop them from the array. How can I do that when I can't view the raid status? I have these problems even If I purposely detach a volume to test.

alphex · on June 15, 2012

And yet we continue to throw everything on AWS...

You know there are OTHER data centers, right?

ceejayoz · on June 15, 2012

Other datacenters can go down, too. In many cases, the complexity of running an application across different platforms (say, AWS + Rackspace Cloud) might not be worth it.

alanh · on June 15, 2012

cloud.engineyard.com isn’t loading either.

edit: I meant literally the EngineYard website at that address. Some EngineYard websites were up and some were down, no doubt based on region.

rjsamson · on June 15, 2012

I'm hosted on EngineYard and my app is up and running just fine.

ctrand · on June 15, 2012

are they on us-east-1?

aaronbrethorst · on June 15, 2012

Anyone know if Amazon Fresh is hosted on EC2? I had the worst connectivity and performance issues with their site earlier today...and now all of my sites are down.

jbermudez5 · on June 15, 2012

So is Parse.com, they just announced it is AWS related.

mschonfeld · on June 15, 2012

At this point, the internet may as well be dead to me.

option_greek · on June 15, 2012

My AWS instances on us-east are unreachable :(

ctrand · on June 15, 2012

It seems to be the elastic load balancers on AWS, can't blame Heroku this time.

My love hate relationship with Heroku continues...

ceejayoz · on June 15, 2012

It's not just ELB.

ctrand · on June 15, 2012

It looks like Amazon are worse at reporting their outages than Heroku...

ceejayoz · on June 15, 2012

During the previous outage (which wasn't AWS related), Heroku's status page was down entirely (among other things, it relied on static assets from heroku.com), so I can't say I agree with that.

ctrand · on June 15, 2012

Yep, but they saw that flaw and addressed it and it's ok for the time being. Amazons has been and still is crappy.

talos · on June 15, 2012

knocked out some other stuff? gothamist/chicagoist/laist/all those other blogs are out.

breck · on June 15, 2012

We have an unreachable instance in us-east-1b but others in that region are reachable

michaelfairley · on June 15, 2012

FYI: Your 1b is not the same as other people's 1b: http://aws.amazon.com/ec2/faqs/#How_can_I_make_sure_that_I_a...

Various software used to hardcode 1a, so 1a received disproportionate load. Now, everyone's a-e is randomized among the "true" a-e, meaning that even if everyone hardcodes 1a, the load will still be evenly distributed.

ebroder · on June 15, 2012

Keep in mind that not everybody sees the same names for the same availability zones - your us-east-1b might be my us-east-1d

jszielenski · on June 15, 2012

I want a credit on my Heroku account. Paying $71/month for shit like this is stupid.

ceejayoz · on June 15, 2012

Have you tried asking for one?

cmelbye · on June 15, 2012

Maybe for the database service, but dynos and workers are paid for by the hour, aren't they?

zeeg · on June 15, 2012

All services are billed based on time usage (its not hourly, its much more granular), including the database.

That said, an outage is still an outage.

esente · on June 15, 2012

Not sure if it's related, but www.pythonanywhere.com is also down.

ceejayoz · on June 15, 2012

Doesn't look like it, their IP is owned by a German company.

fookyong · on June 15, 2012

google searches for "migrating off heroku tutorial" just spiked.

iamandrus · on June 15, 2012

I'm getting a "Request limit exceeded" error in my EC2 panel.

PabloOsinaga · on June 15, 2012

parse.com also down

matkiros · on June 15, 2012

I confirm this. Was working on an app prototype and when I reloaded the page it couldn't find it.

damian2000 · on June 15, 2012

quora.com is down

cardmagic · on June 15, 2012

This is why http://AppFog.com/ is investing in multiple IaaS and is not being hit nearly as hard.

narrator · on June 15, 2012

42floors.com is down.

mschonfeld · on June 15, 2012

#AWSpocalypse

gojomo · on June 15, 2012

...the sequel.

on June 15, 2012

[dead]

dangrossman · on June 15, 2012

You can serve your own error pages instead of Heroku's and show your boss whatever you want. They're specified as URLs, so they can be hosted on some other platform. I don't know if that feature works when AWS is failing Heroku, but if Heroku's up enough to serve the error page, maybe it's up enough to serve the error page you configured instead of its own.

https://devcenter.heroku.com/articles/error-pages

hardoncollider · on June 15, 2012

Solid, thanks for the heads up!

xxpor · on June 15, 2012

Isn't it your fault you didn't build a fault tolerant application? The first rule of building services is assume everything is broken.

hardoncollider · on June 15, 2012

Honestly, I haven't the resources to guarantee site uptime, and have accepted this will happen as a result; my complaint was more targeted at the default error messages.

It seems this can be changed, though, fortunately for Heroku.

xxpor · on June 15, 2012

I guess my real point is your customer, 99% of the time, doesn't give a shit WHY your site is down. It just is.