Half the startup world is offline right now, yet the AWS status page is all greens with one "info" notice about EBS (which I doubt is the issue here). I'm glad to have moved off a fully AWS stack back to my own servers over a year ago. My uptime, bills, and stress level are all much improved.
That's hardly surprising. The groupthink on cloud services is that they magically make everything scalable, cheap and easy. Making redundant software is hard work.
They posted an update on amazon: We continue to investigate this issue. We can confirm that there is both impact to volumes and instances in a single AZ in US-EAST-1 Region. We are also experiencing increased error rates and latencies on the EC2 APIs in the US-EAST-1 Region.
Also Amazon Relational Database Service (N. Virginia) is unavailible.
Dear Heroku -- I know it's my job to make sure my site is available (/thread). However, I think I speak for most enterprise customers when I say I will throw money at your company the second you come up with a multi-zone/highly-available.
Everyone should take a page from the book of Netflix right now. It's pretty embarrassing to be anyone that's entirely down and can't do a thing about it due to an EC2 outage.
How do you explain to your customers/users/etc that you were down and have absolutely no control of when you will be back online? How can you explain it to yourself?
"We're sorry about the current downtime. We know some of you are frustrated, so we thought we'd take a moment to explain why this happened.
"Running a web server is very expensive. After we've built the site, if we want to keep it running, someone needs to be on-call 24 hours a day. That means at least one full-time staff member who does nothing else-- more if we want them to stay sane.
"To save us and you some money, we've contracted maintenance of our servers out to a third-party service. This is great for us, since they run it more reliably than we could, and it's great for you, because it costs a lot less. But the downside is that things still break sometimes, and when they do, it's completely out of our hands. We're left waiting for things to get fixed just like you are.
"So we understand your frustration; we're frustrated too. But unfortunately, downtimes do happen. Guaranteeing our service 100% of the time would cost hundreds of thousands of extra dollars per year, and for most of our users, that's simply not worth the cost. Our provider guarantees 99.[nines]% reliability for much less money, and this is the 0.01%.
"If you have something that absolutely must get done, shoot us an email right now and we'll take care of it for you as soon as the site is active again.
"Although this is technically out of our hands, we aren't trying to shift the blame; we made this decision with open eyes, and we stand by that decision. Again, we sincerely apologize for the temporary inconvenience. We hope we can make it up to you with some new features we'll be rolling out this month :)"
Yes, it's actually easier to apologize to your customers when you have AWS to blame for the outage. After all, you're in good company ("even Heroku is down - what do you want from us?")
Because this shit is really hard especially when you're trying to build a product at the same time.
It's not like you can just wake up one day and say "I'm gonna go build a fully fault tolerant distributed system that works across multiple data centers!" and then you're done by the time you go to sleep.
Go actually talk to some Netflix engineers. They'll tell you the same thing.
Yes, you're absolutely right. However, Netflix is distributed across multiple AZ, while Heroku has spent the last two years after their $212MM acquisition in the same AZ.
That makes it sound like Netflix has a more reliable platform than the PaaS company.
That's my point exactly. Everyone relies entirely (almost) on EC2 for mission critical business, and then they're left there with no outs as soon as a big outage occurs.
this. stop talking about how great a MVP is and then complain when people haven't build multi region distributed services that are fault tolerant to major platform outages.
Maybe you tell them that they won't be able to trade cat pictures for some time. And that the outage happened because you didn't overinvest in infrastructure so that you could keep prices down and roll out some new features faster.
Even if you aren't a cat picture site, many startups can deal with some downtime occasionally and it's the right tradeoff to make.
That was their uptime for May, but looking at June it's going to be a worse picture. They're already down to 99.63% if you only include "red" incidents and down to 99.25% if you include "amber" incidents as well (as of 39m of downtime for this latest incident and assuming they don't have any more downtime for the month).
They could have done it to the extreme -- show the numbers for the past hour. Then they could almost always report 100% uptime. And if they ever went down, wait an hour, then go back to reporting 100% uptime again.
It doesn't matter whether gauged by the month or year. It's a percentage. If they have 99.97% uptime every month for a year, they'll still have 99.97% uptime for the year.
I just got up (it's morning in Israel), and a client of mine in the US with a major, mission-critical application was screaming (rightly so) that things are down.
We're already looking into alternatives -- perhaps not leaving Heroku altogether, but certainly not depending on them 100 percent. There's no way that we can entrust the business to something that can just catastrophically fail at any moment. I've been running my own servers for years, and they've never had such unpredictable issues.
I increasingly have to think that a few servers, on different providers, with the application deployed via Capistrano, will be more fault-tolerant than Heroku. At least, it seems that way right now.
> There's no way that we can entrust the business to something that can just catastrophically fail at any moment.
Anything, including service providers, can catastrophically fail at any moment. Fault-tolerant architectures are based on redundancy (including infrastructure provider redundancy, as you mention), not on "guaranteed" SLAs.
Provider redundancy goes against the concept of PaaS IMO (ignoring the sci-fi future where there are multiple 100% compatible providers). Heroku needs to become internally redundant to really live up to its promise.
I know AWS EC2 does support multiple global regions and within each region multiple availability zones. You're supposed to host your application across more than one AZ if you want good fault tolerance. Not sure if Heroku uses this though.
I've had so much more downtime with heroku/AWS than I ever had back when I was running my sites on slicehost.
I also feel like I've let my admin skills deteriorate because I've been dependent on heroku. Back when I was running everything myself, worst case scenario I could set up a new VPS from a backup in another datacenter. Now if heroku goes out I just have to twiddle my thumbs while I wait for updates.
I understand what you are saying. I host most of my projects on Rackspace. I've been intending to convert current ideas to Heroku because learning SysAdmin stuff doesn't seem to have much upside potential. However, these multiple outages while Rackspace (seemingly) keeps chugging along are discouraging. That being said, every Heroku outage makes front page of HN, and there could have been Rackspace downtimes much more often that I didn't notice.
Eh, I'll probably continue converting to Heroku and not look back.
I haven't see any RS-wide outages in my datacenter (IAD) for many years (I vaguely remember some networking issue, but that was something like 5 years back).
So they have their infrastructure (network, power) working very well.
The downtime we've had were our own unique hardware/software issues that come with a complex bare metal installation.
Seems like someone posted a link to a project that made self-hosting a previously Heroku hosted site simple, but I can't find it now...
...would be cool if there was a Linux package (or distro) that you could boot-up and then just change your git remote to and have your app up-and-running on your own hardware.
BTW, while the point is to enable private paas, you won't get around the issues that hit sites like this without heeding all the warnings and recommendations about building in redundancy for high availability.
"""It is a lot cheaper to add 1% uptime to a 95% SLA than it is to add 0.09% to a 99.9% SLA. Cloud application vendors (SaaS) need to pay very close attention to the additional resources that are invested in order to support a 99.9XX…% uptime SLA, and perhaps build it into their pricing plans."""
Yeah I'm just evaluating options, if nothing else it would be excellent to have the equivalent of a "donut spare" on private hardware that we could throw on when events like this occur (although having done plenty of "self-hosted" sites during the first boom, I'm always running the numbers for either option).
I made Dokuen a few weeks ago, maybe that's what you're thinking of? If you're willing to live with a few warts, it's working pretty well for my personal use. My blog is still on Heroku so you can't really read about it now, but you can check out the code.
I saw this a while ago and will actually be using it very shortly to deploy 3-4 internal apps on our own mini cloud. I love Heroku and can't stand all of the open source "alternatives" like Cloud Foundry. Yours is exactly what I wanted and I can't wait to really start using it and contributing via github! Thanks again!
At their size they need to be running and managing their own hardware. I'd use them more if it wasn't hundreds of dollars a month to host a few apps that might not be up when my customers are.
It may be an AWS outage as the cause, but ultimately it's Heroku's problem. They're the ones touting:
"Erosion-resistant architecture.
Heroku takes full responsibility for your app's health,
keeping it up and running through thick and thin..."
The thick happened today, something eroded, and people are holding them to their word: full responsibility for their app's health. Heroku could do load balancing between multiple independent providers rather than be solely dependent on [one region of?] AWS.
Another update from Amazon: "9:55 PM PDT We have identified the issue and are currently working to bring effected instances and volumes in the impacted Availability Zone back online. We continue to see increased API error rates and latencies in the US-East-1 Region." Been thinking that maybe most startups are seeing that cloud computing is the most reliable way to go, but today I'm reconsidering having another type of backup server. Just hope there is no data loss in the apps.
Related: Right now when I try to cat /proc/mdstat or use mdstat to look at my RAID status, it just hangs. Seems it's trying to contact the EBS volumes and it's just failing. Any way to actually view my raid status?
It's probably best not to muck around with the RAID right now when the drives it thinks are there aren't actually there. If it were me, I wouldn't touch anything until Amazon fixes itself.
It's more a theoretical question than anything else - Say this outage had gone on for days, I'd have needed to be able to see which volumes have failed and drop them from the array. How can I do that when I can't view the raid status? I have these problems even If I purposely detach a volume to test.
Other datacenters can go down, too. In many cases, the complexity of running an application across different platforms (say, AWS + Rackspace Cloud) might not be worth it.
Anyone know if Amazon Fresh is hosted on EC2? I had the worst connectivity and performance issues with their site earlier today...and now all of my sites are down.
During the previous outage (which wasn't AWS related), Heroku's status page was down entirely (among other things, it relied on static assets from heroku.com), so I can't say I agree with that.
Various software used to hardcode 1a, so 1a received disproportionate load. Now, everyone's a-e is randomized among the "true" a-e, meaning that even if everyone hardcodes 1a, the load will still be evenly distributed.
You can serve your own error pages instead of Heroku's and show your boss whatever you want. They're specified as URLs, so they can be hosted on some other platform. I don't know if that feature works when AWS is failing Heroku, but if Heroku's up enough to serve the error page, maybe it's up enough to serve the error page you configured instead of its own.
Honestly, I haven't the resources to guarantee site uptime, and have accepted this will happen as a result; my complaint was more targeted at the default error messages.
It seems this can be changed, though, fortunately for Heroku.
EDIT: Various non-Heroku EC2-East-based sites (e.g. Quora) seem to be down as well, lending more evidence to this being an EC2/EBS outage.
1: http://status.aws.amazon.com/