Reddit: 2012 State of the Servers

gameshot911 · on Jan 26, 2012

Having no experience with database/website administration myself, I'm struck by just how little I'm able to translate the works and concepts in this post into actual, manual labor.

For each and every thing that Jason talked about...upgrading Cassandra, moving off EBS, embarking on self-heal and auto-scale projects...what took the reader a few seconds to read and cognise undoubtedly represented hours and hours of work on the part of the Reddit admins.

I guess it's just the nature of the human mind. I don't think I could ever fully appreciate the amount of work that goes into any project unless I've been through it myself (and even then, the brain is awesome at minimizing the memory of pain). So Reddit admins, if you're reading this, while I certainly can't fully appreciate the amount of labor and life-force you've dedicated to the site, I honestly do appreciate it, and I wish you guys nothing but success in the future!

markerdmann · on Jan 26, 2012

It's interesting to see that they're sticking with Cassandra, and that they're having a much better experience with 0.8. I've been hearing so many fellow coders in SF hate on Cassandra that I had stopped considering it for projects. Has anybody worked with 0.8 or 1.0? Would you recommend Cassandra?

I got to work with Riak a lot while I was at DotCloud, but the speed issue was pretty frustrating (it can be painfully slow).

rbranson · on Jan 26, 2012

This is because people came to the table with unrealistic expectations. They were used to dealing with mature software based on decades old proven ideas and coming into very experimental territory expecting to get a smooth experience.

Cassandra has enabled Reddit to manage a highly scalable distributed data store with a tiny staff. This is not to say it has been trouble free, but it has enabled them to do something that would have been infeasible without pioneers in this space (Cassandra, Riak, Voldemort, etc) making these tools available.

onemoreact · on Jan 26, 2012

I respect the Reddit team, but I don't think they need to use Cassandra at their scale. I mean they only have 2TB of data in total. They should easily be able to use a simple caching system to keep the last 2 weeks of data in RAM and basically never read from the database.

That said, they may be freaked out based on their growth curve and simply thinking ahead.

techscruggs · on Jan 26, 2012

They said that they had 2TB in postgres not 2TB of total data. I imagine all of their data is probably about an order of magnitude larger. Additionally, the challenges are not as much around how much data you have, but how you want to access that data (indexes).

rbranson · on Jan 26, 2012

Indeed. It boils down to their need for a durable cache. It's simply too expensive to try to cache every comment tree in RAM, and Cassandra's data model and disk storage layout is a really good fit for the structure of their data.

onemoreact · on Jan 26, 2012

You don't need every comment tree in RAM just the last few days worth plus a few older threads that get linked back to. They are currently using 200 machines so let's say 10 of them are used to cache 1 weeks comments. 30 GB of ram * 10 machines = 300GB of cache. I would be vary surprised if they generate 200GB/week or 10TB of comment data a year.

Edit: For comparison Slashdot spent a long time on just 6 less powerful machines vs the 200+ Reddit is using. Reddit may have more traffic, but not 40x as much. And, last I heard HN just uses one machine.

PS: The average comment is small and they can compress most comments after a day or so. They can probably get away with storing a second copy of most old threads as a blob of data in case people actually open it which cost a little space, but cuts down on processing time.

rbranson · on Jan 26, 2012

Please. Reddit does 2 billion page views per month.

onemoreact · on Jan 26, 2012

Yea, and Casandra was built for a company serving 500 times that.

henrikschroder · on Jan 26, 2012

The one great thing with Cassandra is how easy it is to expand your cluster. You just start a new server up, point it to the existing cluster, and it automatically joins it, streams the sharded data it should have to itself, and start serving requests.

Balancing your cluster requires a little bit more handholding, and if something goes wrong or you fuck it up, it can be pretty challenging. But most of the time it's pretty painless.

There are a lot of other warts though, the data model is slightly weird, the secondary indexing is slow, and eventual consistency is hard to wrap your head around, but it doesn't require much effort to run and operate a large cluster, and if that's important to you and your application, you should check it out.

The NoSQL space is pretty interesting, but there is no clear winner, each of the competing solutions have their own niche, their own specialities, so it's impossible to give general recommendations right now.

shin_lao · on Jan 26, 2012

That's what we hear from our customers as well. They complain about excessive CPU and memory usage.

The two phases we've seen are:

1/It's flexible and it works! Problem solved! 2/21st century called, they want their performance back.

The problem with phase 2 is that you may not be able to solve it by throwing more computing power at it.

Unfortunately if you really need map-reduce, at the moment I don't know what to recommend. Riak isn't better performance-wise and our product doesn't support map-reduce (yet).

However if you don't need map-reduce I definitively recommend not using Cassandra. There's a lot of non-relational databases out there that are an order of magnitude faster.

jbellis · on Jan 26, 2012

Be careful to compare apples to apples. Sure, the memory-only crowd (e.g, redis) will post higher numbers, but Cassandra is the performance leader for scalable, larger-than-memory datasets. See http://www.cubrid.org/blog/dev-platform/nosql-benchmarking/ for example. (And this tests an old version of Cassandra; we did a lot of optimization on the read path for 1.0: http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-...)

shin_lao · on Jan 26, 2012

Didn't want to sound like someone scorning other people's work, I'm sure you did a lot of great things for 1.0.

However I have the gut feeling we're far from squeezing out all the juice from today's hardware.

thought_alarm · on Jan 26, 2012

It reminds me of Slashdot circa 1998/99, back when we watched those guys grow their then-new-found popularity out of a dorm-room Linux box; at a time when the web was a mere fraction of the size it is today.

Godspeed, reddit. You're on the right track.

joevandyk · on Jan 26, 2012

They say they moved off ebs and onto local storage for postgres and saw a big increase in reliability and performance.

I did the same for my site last year and it was great.

This is one of the reasons why I haven't moved my Postgres databases to enterprisedb or heroku: they use ebs.

tibbon · on Jan 26, 2012

By local storage, do they mean they are running their DB disks on physical hardware, but the database servers still on EC2?

rogerbinns · on Jan 26, 2012

The Amazon servers have local disks physically attached. They are wiped between customers, on machine failure etc hence "ephemeral". The EBS (elastic block storage) is accessed as a disk but is over the other end of a network connection. Amazon does more to ensure the contents are available and durable (eg replication, backup to S3). The problem with EBS is that performance especially latency is highly variable and unpredictable.

x3c · on Jan 26, 2012

But how do you achieve data persistence in case of server crash? Snapshots are not reliable for that, slave db servers aren't foolproof either.

ruckusing · on Jan 26, 2012

We have the same setup, use local ephemeral disks on EC2 with Postgres. We never even tried EBS as we just heard too much negative things about it, namely its variance in performance.

So our approach is to RAID-10 (4) local volumes together. We then use replication to at least 3 slaves, all of which are configured the same and can become master in the event of a failover.

We use WAL-E[0] to ship WAL logs to S3. WAL-E is totally awesome. Love it!

[0] https://github.com/heroku/wal-e

fdr · on Jan 27, 2012

I'm glad you like wal-e. I tried rather hard to make it as easy as I could to set up.

Please send feedback or patches, I'm happy to work with someone if they had an itch in mind.

If one has a lot of data, EBS becomes much more attractive because swapping the disks in the case of a common failure (instance goes away) is so much faster than having to actually duplicate the data at the time, presuming no standby. Although a double-failure of ephemerals seems unlikely and the damage is hopefully mitigated by continuous archiving, the time to replay logs in a large and busy database can be punishing. I think there is a lot of room for optimization in wal-e's wal-fetching procedure (pipelining and parallelism come to mind).

Secondly, EBS seek times are pretty good: one can, in aggregate, control a lot more disk heads via EBS. The latency is a bit noisy, but last I checked (not recently) considerably better than what RAID-0 on ephemerals for some instances would allow one to do.

Thirdly, EBS volumes are sharing with one's slice of the network interface on the physical machine. That means larger instance sizes can have less noisy-neighboring effects and more bandwidth overall, and RAID 1/1+0 are going to be punishing. I'm reasonably sure (but not 100% sure) that mdadm is not smart enough to let a disk with decayed performance "fall behind", demoting it from an array, using a mirrored partner in preference. Overall, use RAID-0 and archiving instead.

When an EBS volume suffers a crash/remirroring event they will get slow, though, and if you are particularly performance sensitive that would be a good time to switch to a standby that possesses an independent copy of the data.

[0]: http://orion.heroku.com/past/2009/7/29/io_performance_on_ebs...

cheald · on Jan 26, 2012

Lots and lots of replication, more or less. I don't know how it works in Postgres, but with something like Mongo, you set up a replication cluster and presume that up to (half - 1) of the nodes can fail and still maintain uptime. Postgres, being a relational database rather than a document store, likely has an additional set of challenges to overcome there, but it's very possible to do.

_3u10 · on Jan 26, 2012

You just copy the WAL log to another server and replay it. It takes a day to setup and test. Once that is setup you have two options async replication (which means you'll lose about 100ms of data in event of a crash) or you can use sync replication which means the transaction doesn't commit until the WAL log is replicated on the other server. (that adds latency but doesn't really affect throughput)

I'm not exactly sure how the failover system works in Postgres, the last time I setup replication on postgres it would only copy the WAL log after it was fully written, but I know they have a much more fine grained system now.

If you use SQL Server you can add a 3rd monitoring server and your connections failover to the new master pretty much automatically as long as you add the 2nd server to your connection string. Using the setup with a 3rd server can create some very strange failure modes though.

notaddicted · on Jan 26, 2012

Another possibility on the AWS cloud is to put the logs on S3, Amazon S3 has High Durability. Heroku has published a tool for doing it: https://github.com/heroku/WAL-E , which they use to manage their database product.

darwinGod · on Jan 26, 2012

If this was for Postgres refer this : http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial See the section "Starting Replication with only a Quick Master Restart", I tested this with a DB <10 GB. The setup time for replication was in the order of minutes. On a related note, it would be awesome if someone could point out real-world experiences for Postgres 9.1 synchronous replication.

jarcoal · on Jan 26, 2012

What a bummer. Amazon advertises EBS as being both faster and more reliable, but it sounds like they are delivering neither.

iand · on Jan 26, 2012

By reliable Amazon mean "won't lose your data" and they deliver on that. The issue in the articlew is around latency and Amazon aren't making any claims in that area. High throughput databases need steady latency guarantees so they're not a great fit for EBS. EBS is great for many other scenarios though.

cheald · on Jan 26, 2012

EBS is fine until it isn't. The problem isn't general EBS suckage, it's unpredictable and sporadic suckage. When your DB server is blocking while it tries to write to a disk that isn't responding, things get really hairy really quickly.

cluda01 · on Jan 26, 2012

I'm unfamiliar with hosting costs or really any costs running a site as popular as reddit. Anyone with experience in this area have a ballpark figure for how much it would cost per month to run this sort of setup?

davej · on Jan 26, 2012

My back of the envelope estimate. These are based on the figures from last year and the fact that they currently have 240 EC2 instances, some are large (guessed 70), more are x-large (guessed 170).

8760 is the number of hours in a year.

(8760 * $0.24 * 170) + (8760 * $0.12 * 70) = $430,992/yr in hourly fees

($1,820 * 170) + ($910 * 70) = $373,100/yr in reservation fees

373,100 + 430,992 = 804,092 / 12 months = $67,007.67/mo

Reference for last years calculations: http://www.reddit.com/r/blog/comments/ctz7c/your_gold_dollar...

cheald · on Jan 26, 2012

Nearly $1million/year in infrastructure costs so that I can laugh at GIFs of cats.

The internet is truly a wonderous thing.

rdouble · on Jan 26, 2012

$300K

someone13 · on Jan 26, 2012

Where do you get this estimate from? (Not disbelieving you, just curious)

bru · on Jan 26, 2012

One year and half ago, it was calculated and then confirmed by an admin[1] that the monthly cost was around 22K/month, or 270K/year. jedberg added that they were projecting to be around 350K/year by the end of 2010.

Supposing that the cost increased linearly with the number of users (which sounds like a bad hypothesis, but is a start), the cost at the end of 2011 could be around 1M/year... That's impressive, but nowhere near the 300K/month proposed by rdouble.

So I would say that the monthy cost of reddit's infrastructure is around 90K. Which is really impressive.

1: http://www.reddit.com/r/blog/comments/ctz7c/your_gold_dollar...

rdouble · on Jan 26, 2012

You're probably right as I calculated with expensive instances. Also, when I made my estimate I was guessing at image storage costs, forgetting that the images are coming from image sharing sites.

dhbanes · on Jan 26, 2012

Thanks for the clarification, I thought $300k sounded a little off since cluda01 asked about estimated monthly (not annual) cost.

ketralnis · on Jan 26, 2012

A year and a half is a long time in reddit time

joering1 · on Jan 26, 2012

and also -- any idea how much they can bring id ad revenues?

plasma · on Jan 26, 2012

Usually 1,000 ad impressions is around $1.00 (varies greatly though, can be lower like $0.15c and higher like $4.00+).

Assuming $1.00 per 1,000 impressions, and taking their 2.07billion impressions/month figure:

Roughly 2,070,000,000 / 1,000 = $2,070,000 in ad revenue per month?

Wild guess.

dangrossman · on Jan 26, 2012

Many/most page views on Reddit don't have any ads. Promoted stories only appear on story lists, not individual story pages, and don't appear 100% of the time (the space is also used to promote random new submissions). The graphical slot in the sidebar is almost 100% non-paid in-house ads.

cluda01 · on Jan 26, 2012

I've always wondered why they don't contract out with other ad networks when they cannot fill the ad content themselves. Say for example their self serve ad can't fill the page request why not put in a google text ad link on the right side where the banner is? That to me seems like a straightforward way to massively increase revenues.

redslazer · on Jan 26, 2012

They dont do it because they really care about the user. Just sticking up random google ads isnt going to make anybody happier and with an internet savy crowd like reddit ad clicks are likley to be low. Sure very targeted ads like the ones that self-serve currently delivers work because its redditors advetising to redditors.

daintynews · on Jan 26, 2012

I have the same question. I tried googling it, but no luck.

meenriquez · on Jan 27, 2012

me too.

cmer · on Jan 26, 2012

There's no way in hell it costs $300k per month to run Reddit!

sp332 · on Jan 26, 2012

True, that's probably per-year instead.

ypcx · on Jan 26, 2012

Wondering how much of that 2TB dataset is necessary for the common daily functionality of reddit, probably less than 1%, and the rest is historical data, accessed by almost no one, except perhaps by the submission-dupe- checking algorithms, and similar?

rplnt · on Jan 26, 2012

Are you suggesting moving that down the ladder and not having that data everywhere? I.e. if someone want to see an old post, there would be one extra step required to load the data (so cdn, cassandra, now subset of postgres with not so old data, "full" postgres). I think facebook does something similar, but they really have to, considering their size.

nbm · on Jan 26, 2012

In terms of status updates (ie, stories which may mention check-ins or photos or similar, but not Facebook Messages), before Facebook's Timeline launch, there were multiple stores of data depending on age. With Timeline, all the different versions of data over all ages were put back together into a single (logical) store. More about that process at:

    https://www.facebook.com/notes/facebook-engineering/building-timeline-scaling-up-to-hold-your-life-story/10150468255628920

brador · on Jan 26, 2012

Could we get a public backup of the database already? Make it a torrent if bandwidth is an issue, but lets back that amazing resource up.

obtu · on Jan 26, 2012

Terabytes is starting to be expensive to mirror, in terms of bandwidth and storage.

_csoz · on Jan 26, 2012

Those are staggering numbers, glad i invested my time in reddit last year. We must be cautious of overheating though, signs of a bubble or a possible subreddit crisis.

ctekin · on Jan 26, 2012

Does anyone know what kind of hardware those 240 servers have? I wonder how much they cost.

davej · on Jan 26, 2012

They're EC2 large and x-large instance.

Ecio78 · on Jan 26, 2012

What about IndexTank? They dont talk about it in this blog post. Have they stopped using it?

eco · on Jan 26, 2012

They still use it. Whenever you search you get the "Powered by IndexTank" logo in the corner.

_3u10 · on Jan 26, 2012

Running a DB on a single spindle, and they have performance problems?

I couldn't imagine why.

2 TB OMG, thats almost a decent sized SQL Server instance. Yeah, it should take about an hour or two to replicate. I'm assuming they have a 10Gb enet on their DB server.