Hacker News new | past | comments | ask | show | jobs | submit login

Amazon has probably correctly designed core infrastructure so that these things shouldn't happen if you're in multiple Availability Zones. I'm guessing that means different power sources, backup generators, network hookups, etc. for the different Availability Zones. However, there's also the issue of Amazon's management software. In this case, it seems that some network issues triggered a huge reorganization of their EBS storage which would involve lots of transfer over the network of all that stored data, a lot more EBS hosts coming online and a stampede problem.

I've written vigorously (in previous comments) for using cloud servers like EC2 over dedicated hosting like SoftLayer. I'm less sure about that now. The issue is that EC2 is still beholden to the traditional points of failure (power, cooling, network issues). However, EC2 has the additional problem of Amazon's management software. I don't want to sound too down on Amazon's ability to make good software. However, Amazon's status site shows that EBS and EC2 also had issues on March 17th for about 2.5 hours each (at different times). Reddit has also just been experiencing trouble on EC2/EBS. I don't want this to sound like "Amazon is unreliable", but it does seem more hiccup-y.

The question I'm left with is what one is gaining from the management software Amazon is introducing. Well, one can launch a new box in minutes rather than a couple hours; one can dynamically expand a storage volume rather than dealing with the size of physical discs; one can template a server so that you don't have to set it up from scratch when you want a new one. But if you're a site with 5 boxes, would that give you much help? SoftLayer's pricing is competitive against EC2's 1-year reserved instances and SoftLayer throws in several TB of bandwidth and persistent storage. Even if you have to over-buy on storage because you can't just dynamically expand volumes, it's still competitively priced. If you're only running 5 boxes, the server templates aren't of that much help - and virtually none given that you're maybe running 3 app servers, and a replicated database over two boxes.

I'm still a huge fan of S3. Building a replicated storage system is a pain until you need to store huge volumes of assets. Likewise, if you need 50 boxes for 24 hours at a time, EC2 is awesome. I'm less smitten with it for general purpose web app hosting where the fancy footwork done to make it possible to launch 100 boxes for a short time doesn't really help you if you're looking to just have 5 instances keep running all the time.

Maybe it's just bad timing that I suggested we look at Amazon's new live streaming and a day later EC2 is suffering a half-day outage.




I'm responsible for a relatively large site ( http://www.foreignpolicy.com ) that was down for 12+ hours over this failure today.

One fallacy that I think that many people make in the whole cloud debate is the idea that a given cloud provider is any more or less failure prone than a given dedicated server host.

We have assets on Amazon, Slicehost, and Linode. Sometimes these go down, whether it's our fault, software's fault, hardware's fault, or a construction crew hitting a fiber drop, things happen. If you're not backed up in a fully tested way on not just another server or availability zone, but whole different hosting infrastructure (preferably in a different time zone), then you're not really backed up. Being on a host like Amazon, or even a fully managed host like a Cadillac Rackspace plan doesn't remove the need for good BCP.

What these cloud services allow you to do in theory is have that backup infrastructure ready to go on relatively short notice _without_ keeping it running all the time. We can't reasonably afford to replicate all of our servers and hot data to Western Region or the Rackspace cloud 24/7. We can, however, afford to set up the infrastructure and spin it up on the fly within an hour with slightly stale data once a month to test it, and when for things break. Requisitioning that kind of hardware and then dumping it for only a few tens of dollars a month is difficult if not impossible on a virtual host.

The big question is not 'Is the cloud more reliable?', but 'Do i need what only the cloud can offer?'. If your current infrastructure can handle getting drudged or reddited fine, and you're only on a few servers, you're probably better off just paying to keep a hot spare up at softlayer.

On the other hand if you have 1) Occasional traffic bursting that you don't want to pay to handle most days and 2) Can accept a few minutes of downtime, then the solutions offered by cloud hosts blow the competition out of the water. I guess what you're gaining is not the management software, it's the ability to turn off & on quickly when something goes wrong (or, in the case of a redditing, right).

Part of figuring out the right hosting solution involves asking the right questions.

(..and for reference, we were all ready to go with a backup... and then we learned that our hosting company was storing our nightlies on S3 and couldn't retrieve them, and that our offsite DB solution was having an unrelated issue). Had we run proper tests (I'm brand new to the job), we would've been ready for this one. I also worry big time about DNS and load balancing being a big SPF, but that's a plan for another day.


What about hardware failure? On AWS you just commission a new instance and your downtime is minutes rather than hours, plus you don't have to keep extra hardware on hand just to avoid downtime of days. There are also smaller more localized issues like network switch failure and other things that you probably never even notice on Amazon, but might be more likely to bite you on a dedicated host.

If an AWS data center goes down it gets a lot of press, but does it actually outweigh the sum of all dedicated/shared/vps hosting issues on the equivalent volume?


There are some nice middle options out there. I'll use Softlayer as an example as I have provisioned a lot of machinery over there.

I can order machines online and SSH in 3-4 hours later. Even exotic stuff they turn around just as fast - we saw that speed on a quad octocore box with a raid 10 of Intel SSDs.

That's real metal too, with real IO (most of my work is IO bound so VMs and the cloud are not options). You get to pick the exact CPUs, disks, etc and they slot them in solid Super Micro boards and use good Adaptec disk controllers. You pay monthly and can spin down the box at any time (though must pay full months, no per-minute pricing like AWS).

That is on the dedicated hardware side, you can also spin up compute instances and those can be cloned and fired up in bulk. But, they also have the IO problems that all other VMs have.

In any case, just wanted to mention they are a decent middle ground. Not as automated and polished as Amazon on the VM side but you can spin up mixtures of metal and VMs to get combinations that make sense - pushing compute or RAM-only stuff to VMs and keeping DBs and persistence layers on real metal. They have a few different datacenters too so you can spread gear around physical locations.


I'm fairly sure that my downtime due to a hardware failure at softlayer would be less than the downtime AWS has had for huge numbers of people this year. And hardware failures on a given server happen less frequently than 1/year on average.

Problems are just not as common if you're running on a handful of dedicated machines, and a single dedicated machine at a good host can handle a LOT without having to do all the crazy reliability engineering that running on AWS requires. You need backups, but you don't need that same assumption that you need to be able to failover instantly or you will have guaranteed downtime sometime soon. I don't think that that difference can be overstated, since it lets you focus on more important things.


Speaking of Softlayer specifically, they've diagnosed then replaced failed hardware for me (hard disks and power supplies so far) in 15-30 minutes from the time I opened a support ticket. One of the incidents was around 2AM local time where the server is and their response time was the same.


For entities that have the CapEx money to build out their own hardware to handle expected growth, and do it a little cheaper due to volume, does it still make sense to engage in the cloud game?

Or is it a better option when you are starting up, and want to be able to quickly throw hardware at a problem, should the need arise?

Apologies if this sounds like a pretty ignorant question, but I haven't implemented cloud-based services before. It seems like there is a hardware cost vs. people cost due to the newer nature of AWS and the like, and that needs to be factored into development / maintenance time.

Saving people time by relying on a known quality like arrays of Linux servers with failure tolerance seems preferable.


"Mutually exclusive" zones may all depend upon the same administrators, same decision making, same software, same architecture design.


I agree with your entire comment with the exception of one sentence. Disagree as strongly as I can here:

I've written vigorously (in previous comments) for using cloud servers like EC2 over dedicated hosting like SoftLayer. I'm less sure about that now.

An issue at Amazon, or Rackspace, or Linode, or Slicehost need not imply failure at other providers and cloud as an alternative to dedicated is still as viable as ever. Amazon tanking does not mean everybody needs to run back to dedicated, and my pet peeve is that when one provider takes a crap everyone paints the cloud as toxic.

When ThePlanet's facility exploded a few years ago I did not hear lamenting that dedicated hosting was doomed. When an airliner crashes we do not say air travel is doomed. I do not understand why people rush to paint cloud as a toxic choice in light of a failure of a certain player. Admittedly a big one but there are others too and you can move.

Providers like Linode are almost exactly equivalent to dedicated hosting. They just administer the hardware for you and pay the remote hands bills. Same for Slicehost and Rackspace. It is simply far easier to wipe your instance and start over and for all intents and purposes it acts like a dedicated box. You need to administer it like one too. Most failures of the "cloud" are really designing your application in violation of the fallacies linked elsewhere.


Show me one cloud offering that gives consistently good or even average disk performance. (hint: there isn't one)

Basically, if you're running a database that does not completely fit in memory you should be on dedicated hardware.

I'd also point out that a lot of advantages that people routinely cite as cloud strengths are more about cloud vs traditional hosting or colocation as opposed to cloud vs a place like softlayer. softlayer can provision a custom build in a few hours (yeah vs. minutes, but who really cares that much) and you pay month-to-month without a contract.


Show me one cloud offering that gives consistently good or even average disk performance.

You mean like newservers.com, SoftLayer Bare Metal Cloud, stormondemand, or one of the other metal clouds?


I believe the orionvm cloud would qualify as "good or even average disk performance" (http://orionvm.com.au/blog/3rd-Party-Performance-Benchmarks/ Benched by cloudharmony.com), as they are 60% faster than a dedicated server with 4 * 15k SAS disks.

Disclaimer: I'm a director at orionvm.


I run a database that does not fit in memory on AWS with great success. Generalizing is dangerous. If you work with and understand the constraints imposed by the environment you can do some pretty amazing stuff.


I have found Linode disks to have the best performance in all

http://pastebin.ca/2049137

You are correct, I/O is the challenge in administering systems in a virtual environment. My database which does not fit in memory does fine on a high-load site because I cache it responsibly. For comparison, here are awful results from a new player called ChunkHost, who I signed up for with the purpose of testing

http://pastebin.ca/2049142

The sequential write throughput there is troubling. This comparison from a couple years back is interesting too

http://journal.uggedal.com/vps-performance-comparison/

I've linked this URL before but it really does the best of breaking it down. What cloud providers have you tried? In my experience there are vast gaps between certain ones, Amazon no exception. Hard to stereotype cloud with gaps like those.

Even if SoftLayer could provision me a new box in ten minutes the improvement to my sleep from not waking up for every disk failure and submitting a remote hands ticket at who knows how much per pop far outweighs anything else.


I have also had good experience with linode performance (just a small instance I use for a few personal projects). However, AFAIK linode is just using local on-box disk, which is a whole different animal from EBS.


To say that a database has to fit entirely in memory to achieve good performance is a ridiculous proposition, and simply shows you have zero actual knowledge of modern database server internals or administration. Countless sites happily serve oodles of pageviews per day with actual memory usage far below the disk space used by their databases. Hint: they're not swapping, either.

In general, if you really believe what you're saying, you either (1) have a very poorly designed application, (2) have a very poorly designed database environment, or (3) are speaking to a specialized application that wouldn't reflect the majority of environments operating in real life. This isn't to say it isn't a combination of these options, mind you. I didn't even start on utilizing caching in applications, because it's clear there are other hurdles to overcome first.


I don't think he is saying that. He is saying that in a non-dedicated environment, you share the same spindles with other tenants who may have different I/O access patterns than your application. Careful choice of indexes, good data locality for fast reads, making sure writes are sequential - all that goes out the window if some other application is causing the disk to seek all over the place.


Yeah, but Amazon is the 800 lbs gorilla in the room when it comes to the cloud. When most people say 'the cloud' they are referring to Amazon. So when Amazon has an issue, rightly or wrongly, the e tire sector gets a black eye.


What I find really interesting is the implication that such outage could have on Amazons business model. Specifically, what I would like to see is transparent complete application duplication to other regions & availability zones for certain customer configurations of particular sizes etc.

The application would be transparently mirrored to another region and if an even such as this occurs, the mirror would be spun up.

The Customer would choose the frequency of snapshot desired, and would pay for that.

Certain sites, with less dynamic content, would be mirrored and continue to operate as normal with minimal impact or cost.

Other sites, where the content creations is fairly real-time from its users, would pose more complex and costly mirroring situations (ala reddit).

But the option should be there.

Also, remember to think of the evolution of amazons services say, 24 months from now, where this type of offering will likely become more a reality.

As too many others have noted, it is best to not be 100% reliant on amazon for your entire services - but at this point in time its a little hard to spread the load between competing offerings to AWS/EC2 etc.


AWS is a bucket of legos, it is up to you to be smart enough to build something.

The option IS there. I know, because I had zero downtime today and am 100% on AWS.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: