Reddit's software is not very impressive. They should really focus on consolidating to a more manageable setup. I'm aware that big sites often have complex environment to facilitate scaling, but frankly, reddit is a big monstrous mess. It'd be a lot simpler to move hosts if they would get some of those extraneous dependencies removed and put some simplicity and sanity into their data storage, instead of the half Cassandra/half PgSQL thing they do now.
Actually, there was a heated discussion on such an issue a month ago at Reddit [1], when AWS started acting up back then. Someone was asking why cann't Reddit buy beefy servers and co-lo them. The arguments were that considering the traffic Reddit is having, it's very difficult to DIY technically if not possible. Looking back, it looks like co-lo is still the way to go, especially for big sites like Reddit. Anyway, if StackOverflow can make it, why cannot Reddit, both have comparable traffic?
Whether you co-lo or run in the cloud it seems to me that you would still want the same types of redundancy/failover processes in place. The advantage of EC2 is you can easily set up servers in multiple places across the globe (or at least across the country if you just want to host in the US.)
and there are many downsides to EC2, dependence on a third party to a greater degree, shitty disk IO, instances go down more frequently than real servers, inflexibility with hardware (no SSDs), etc, etc.
There are co-lo provides that have data centers in different places.
I've discovered through a bunch of similar lessons that buying really good hardware and hosting yourself sometimes requires the least amount of your time of any option. It provides a stable base on which to build additional redundancy such as shared network storage or database replication if you wish (and this is generally how expensive "enterprise" solutions work).
In practice and on cheap hardware, networked storage is flakey and has umpteen failure modes. Database replication is even worse. Both require babysitting by developers or sysadmins and hours to repair when it goes wrong. What is the point outsourcing hardware and scaling to EC2 if you end up with even more work to monitor and keep fixing the infrastructure you build on top?
Its actually very easy to beat the price/performance of ec2 using real hardware. The draw of ec2/aws is in the "everything is a monthly charge and an api call away" operational instant gratification. If you need to auto-scale by the hour there's few other places you can do it. If you can generally capacity-manage your setup a month or more into the future, then a 1-week latency on RunInstances isn't actually a problem.
Money quote: "In response, we started upgrading some of our databases to use a software RAID of EBS disks, which gives drastically increased performance (at a higher cost of course)."
RAIDing EBS disks seems like a really really BAD idea. There is a non-trivial failure rate of any single EBS disk, and if you RAID them together, your failure rate of the RAID will consequently increase. Am I understanding that correctly?
If they fix that, could that be a 'silver bullet' to fix these outages?
They certainly think there's a problem with EBS: "Since that last failure, we have been doing everything we can to move ourselves off of the EBS product. We're about half way there. All of our Cassandra nodes are now using only local disk, and we hope to have all of postgres on local disk soon."
I wouldn't expect them to buy hardware, but I definitely could see them going to a dedicated hosting service like Softlayer (where you still just pay monthly and can usually bring up more capacity fairly fast, but you get whole machines instead of virtualized machines).
It seems like that would make a lot more sense for Reddit, since I/O is so slow and flakey on EC2 (from what I hear), and it's not like they're really taking full advantage of the elasticity that EC2 provides (by massively scaling up and down to fit major load variance)
They probably would use replication to a different region/zone and then they implement their own snapshot/backup mechanism on those replica machines so that they can use them to re-create instances when they lose them. Also PostgreSQL ships logs that you can store in S3 and bring up an instance from another area starting from a known backup point.
EBS is a pretty bad product regarding reliability and consistency of performance unfortunately, it's better to design your system without it.
Re-building by reading logs out from S3 can be very slow. It should be seen as a last ditch effort. Not really a hot swap, or fail over solution (depending on data size obviously).
If you can't use any persistent storage, then these machines become pure processing nodes. Which in this case, I feel like it would be better to not design your system with EC2 at all. :-(
To me having EBS made EC2 a very powerful solution compared with their competitors. Not having durable and consistent EBS otherwise makes no point in differentiation and serves purely as non-functional fluff we end up paying extra for.
If I were them, I'd look at an approach that was based on either SimpleDB or S3. I've been working on a couple of prototypes of systems similar to theirs for my own stuff at work, and I've been toying with a system that uses S3 to store what I'm calling "absolute fallback" sources of data for conversations (basically, JSON documents) and SimpleDB as a front-line store.
My general take on this issue is that if you're running your app on EC2 and your persistence medium is something that's also on EC2, you really have no ideal high availability scenario. Of course, even in my case, if SimpleDB and S3 go down, I'm still in trouble, but at least I have the option of throwing Akamai in front of it.
Has anyone else noticed the Amazon status page says there are still problems today but on the status history show no problems yesterday only on the 21st?
Why would you want to take a chance on that? Every instance should be thought of as expendable. They can go down at anytime.
EBS is really just like any other NAS/iScsi vol and it wouldn't be such a problem if they did what they're supposed to. That is, be consistent in read, write, and durability.
If you accept that any of your instances can go down at any time, you accept the fact all of them can be down at one (the same) time.
You might think your safe with reserved instances too, thinking you've reserved the dedicated time with EC2. Well what happens when the entire network stack goes down or block storage the reserved instance rack depends on goes down? So does all 10 or 20 reserved instances you had up, at once.
Also, it was this very replication/snapshot mirroring feature of EBS that cascaded into (network,etc) congestion.
I wrote a long post about it here: http://www.deserettechnology.com/journal/reddit-the-open-sou... .