Having no experience with database/website administration myself, I'm struck by just how little I'm able to translate the works and concepts in this post into actual, manual labor.
For each and every thing that Jason talked about...upgrading Cassandra, moving off EBS, embarking on self-heal and auto-scale projects...what took the reader a few seconds to read and cognise undoubtedly represented hours and hours of work on the part of the Reddit admins.
I guess it's just the nature of the human mind. I don't think I could ever fully appreciate the amount of work that goes into any project unless I've been through it myself (and even then, the brain is awesome at minimizing the memory of pain). So Reddit admins, if you're reading this, while I certainly can't fully appreciate the amount of labor and life-force you've dedicated to the site, I honestly do appreciate it, and I wish you guys nothing but success in the future!
It's interesting to see that they're sticking with Cassandra, and that they're having a much better experience with 0.8. I've been hearing so many fellow coders in SF hate on Cassandra that I had stopped considering it for projects. Has anybody worked with 0.8 or 1.0? Would you recommend Cassandra?
I got to work with Riak a lot while I was at DotCloud, but the speed issue was pretty frustrating (it can be painfully slow).
This is because people came to the table with unrealistic expectations. They were used to dealing with mature software based on decades old proven ideas and coming into very experimental territory expecting to get a smooth experience.
Cassandra has enabled Reddit to manage a highly scalable distributed data store with a tiny staff. This is not to say it has been trouble free, but it has enabled them to do something that would have been infeasible without pioneers in this space (Cassandra, Riak, Voldemort, etc) making these tools available.
I respect the Reddit team, but I don't think they need to use Cassandra at their scale. I mean they only have 2TB of data in total. They should easily be able to use a simple caching system to keep the last 2 weeks of data in RAM and basically never read from the database.
That said, they may be freaked out based on their growth curve and simply thinking ahead.
They said that they had 2TB in postgres not 2TB of total data. I imagine all of their data is probably about an order of magnitude larger. Additionally, the challenges are not as much around how much data you have, but how you want to access that data (indexes).
Indeed. It boils down to their need for a durable cache. It's simply too expensive to try to cache every comment tree in RAM, and Cassandra's data model and disk storage layout is a really good fit for the structure of their data.
You don't need every comment tree in RAM just the last few days worth plus a few older threads that get linked back to. They are currently using 200 machines so let's say 10 of them are used to cache 1 weeks comments. 30 GB of ram * 10 machines = 300GB of cache. I would be vary surprised if they generate 200GB/week or 10TB of comment data a year.
Edit: For comparison Slashdot spent a long time on just 6 less powerful machines vs the 200+ Reddit is using. Reddit may have more traffic, but not 40x as much. And, last I heard HN just uses one machine.
PS: The average comment is small and they can compress most comments after a day or so. They can probably get away with storing a second copy of most old threads as a blob of data in case people actually open it which cost a little space, but cuts down on processing time.
The one great thing with Cassandra is how easy it is to expand your cluster. You just start a new server up, point it to the existing cluster, and it automatically joins it, streams the sharded data it should have to itself, and start serving requests.
Balancing your cluster requires a little bit more handholding, and if something goes wrong or you fuck it up, it can be pretty challenging. But most of the time it's pretty painless.
There are a lot of other warts though, the data model is slightly weird, the secondary indexing is slow, and eventual consistency is hard to wrap your head around, but it doesn't require much effort to run and operate a large cluster, and if that's important to you and your application, you should check it out.
The NoSQL space is pretty interesting, but there is no clear winner, each of the competing solutions have their own niche, their own specialities, so it's impossible to give general recommendations right now.
That's what we hear from our customers as well. They complain about excessive CPU and memory usage.
The two phases we've seen are:
1/It's flexible and it works! Problem solved!
2/21st century called, they want their performance back.
The problem with phase 2 is that you may not be able to solve it by throwing more computing power at it.
Unfortunately if you really need map-reduce, at the moment I don't know what to recommend. Riak isn't better performance-wise and our product doesn't support map-reduce (yet).
However if you don't need map-reduce I definitively recommend not using Cassandra. There's a lot of non-relational databases out there that are an order of magnitude faster.
It reminds me of Slashdot circa 1998/99, back when we watched those guys grow their then-new-found popularity out of a dorm-room Linux box; at a time when the web was a mere fraction of the size it is today.
The Amazon servers have local disks physically attached. They are wiped between customers, on machine failure etc hence "ephemeral". The EBS (elastic block storage) is accessed as a disk but is over the other end of a network connection. Amazon does more to ensure the contents are available and durable (eg replication, backup to S3). The problem with EBS is that performance especially latency is highly variable and unpredictable.
We have the same setup, use local ephemeral disks on EC2 with Postgres. We never even tried EBS as we just heard too much negative things about it, namely its variance in performance.
So our approach is to RAID-10 (4) local volumes together. We then use replication to at least 3 slaves, all of which are configured the same and can become master in the event of a failover.
We use WAL-E[0] to ship WAL logs to S3. WAL-E is totally awesome. Love it!
I'm glad you like wal-e. I tried rather hard to make it as easy as I could to set up.
Please send feedback or patches, I'm happy to work with someone if they had an itch in mind.
If one has a lot of data, EBS becomes much more attractive because swapping the disks in the case of a common failure (instance goes away) is so much faster than having to actually duplicate the data at the time, presuming no standby. Although a double-failure of ephemerals seems unlikely and the damage is hopefully mitigated by continuous archiving, the time to replay logs in a large and busy database can be punishing. I think there is a lot of room for optimization in wal-e's wal-fetching procedure (pipelining and parallelism come to mind).
Secondly, EBS seek times are pretty good: one can, in aggregate, control a lot more disk heads via EBS. The latency is a bit noisy, but last I checked (not recently) considerably better than what RAID-0 on ephemerals for some instances would allow one to do.
Thirdly, EBS volumes are sharing with one's slice of the network interface on the physical machine. That means larger instance sizes can have less noisy-neighboring effects and more bandwidth overall, and RAID 1/1+0 are going to be punishing. I'm reasonably sure (but not 100% sure) that mdadm is not smart enough to let a disk with decayed performance "fall behind", demoting it from an array, using a mirrored partner in preference. Overall, use RAID-0 and archiving instead.
When an EBS volume suffers a crash/remirroring event they will get slow, though, and if you are particularly performance sensitive that would be a good time to switch to a standby that possesses an independent copy of the data.
Lots and lots of replication, more or less. I don't know how it works in Postgres, but with something like Mongo, you set up a replication cluster and presume that up to (half - 1) of the nodes can fail and still maintain uptime. Postgres, being a relational database rather than a document store, likely has an additional set of challenges to overcome there, but it's very possible to do.
You just copy the WAL log to another server and replay it. It takes a day to setup and test. Once that is setup you have two options async replication (which means you'll lose about 100ms of data in event of a crash) or you can use sync replication which means the transaction doesn't commit until the WAL log is replicated on the other server. (that adds latency but doesn't really affect throughput)
I'm not exactly sure how the failover system works in Postgres, the last time I setup replication on postgres it would only copy the WAL log after it was fully written, but I know they have a much more fine grained system now.
If you use SQL Server you can add a 3rd monitoring server and your connections failover to the new master pretty much automatically as long as you add the 2nd server to your connection string. Using the setup with a 3rd server can create some very strange failure modes though.
Another possibility on the AWS cloud is to put the logs on S3, Amazon S3 has High Durability. Heroku has published a tool for doing it: https://github.com/heroku/WAL-E , which they use to manage their database product.
If this was for Postgres refer this :
http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial
See the section "Starting Replication with only a Quick Master Restart", I tested this with a DB <10 GB. The setup time for replication was in the order of minutes. On a related note, it would be awesome if someone could point out real-world experiences for Postgres 9.1 synchronous replication.
By reliable Amazon mean "won't lose your data" and they deliver on that. The issue in the articlew is around latency and Amazon aren't making any claims in that area. High throughput databases need steady latency guarantees so they're not a great fit for EBS. EBS is great for many other scenarios though.
EBS is fine until it isn't. The problem isn't general EBS suckage, it's unpredictable and sporadic suckage. When your DB server is blocking while it tries to write to a disk that isn't responding, things get really hairy really quickly.
I'm unfamiliar with hosting costs or really any costs running a site as popular as reddit. Anyone with experience in this area have a ballpark figure for how much it would cost per month to run this sort of setup?
My back of the envelope estimate. These are based on the figures from last year and the fact that they currently have 240 EC2 instances, some are large (guessed 70), more are x-large (guessed 170).
One year and half ago, it was calculated and then confirmed by an admin[1] that the monthly cost was around 22K/month, or 270K/year. jedberg added that they were projecting to be around 350K/year by the end of 2010.
Supposing that the cost increased linearly with the number of users (which sounds like a bad hypothesis, but is a start), the cost at the end of 2011 could be around 1M/year... That's impressive, but nowhere near the 300K/month proposed by rdouble.
So I would say that the monthy cost of reddit's infrastructure is around 90K. Which is really impressive.
You're probably right as I calculated with expensive instances. Also, when I made my estimate I was guessing at image storage costs, forgetting that the images are coming from image sharing sites.
Many/most page views on Reddit don't have any ads. Promoted stories only appear on story lists, not individual story pages, and don't appear 100% of the time (the space is also used to promote random new submissions). The graphical slot in the sidebar is almost 100% non-paid in-house ads.
I've always wondered why they don't contract out with other ad networks when they cannot fill the ad content themselves. Say for example their self serve ad can't fill the page request why not put in a google text ad link on the right side where the banner is? That to me seems like a straightforward way to massively increase revenues.
They dont do it because they really care about the user. Just sticking up random google ads isnt going to make anybody happier and with an internet savy crowd like reddit ad clicks are likley to be low. Sure very targeted ads like the ones that self-serve currently delivers work because its redditors advetising to redditors.
Wondering how much of that 2TB dataset is necessary for the common daily functionality of reddit, probably less than 1%, and the rest is historical data, accessed by almost no one, except perhaps by the submission-dupe- checking algorithms, and similar?
Are you suggesting moving that down the ladder and not having that data everywhere? I.e. if someone want to see an old post, there would be one extra step required to load the data (so cdn, cassandra, now subset of postgres with not so old data, "full" postgres). I think facebook does something similar, but they really have to, considering their size.
In terms of status updates (ie, stories which may mention check-ins or photos or similar, but not Facebook Messages), before Facebook's Timeline launch, there were multiple stores of data depending on age. With Timeline, all the different versions of data over all ages were put back together into a single (logical) store. More about that process at:
Those are staggering numbers, glad i invested my time in reddit last year. We must be cautious of overheating though, signs of a bubble or a possible subreddit crisis.
Running a DB on a single spindle, and they have performance problems?
I couldn't imagine why.
2 TB OMG, thats almost a decent sized SQL Server instance. Yeah, it should take about an hour or two to replicate. I'm assuming they have a 10Gb enet on their DB server.
For each and every thing that Jason talked about...upgrading Cassandra, moving off EBS, embarking on self-heal and auto-scale projects...what took the reader a few seconds to read and cognise undoubtedly represented hours and hours of work on the part of the Reddit admins.
I guess it's just the nature of the human mind. I don't think I could ever fully appreciate the amount of work that goes into any project unless I've been through it myself (and even then, the brain is awesome at minimizing the memory of pain). So Reddit admins, if you're reading this, while I certainly can't fully appreciate the amount of labor and life-force you've dedicated to the site, I honestly do appreciate it, and I wish you guys nothing but success in the future!