Hacker News new | past | comments | ask | show | jobs | submit login

If I were to design reddit I would have enforced the following rule to handle their scaling problems. You cannot touch the database when you are reading the homepage. Instead every minute we generate the canonical list of ~500 links that is "reddit". Cache the html for that and throw it on a CDN. The "next" and "previous" timestamped links should reference when it was first loaded so you can continue paging and avoid the duplicate link problem. You can cheat for logged in users by loading in their data after page load with JS, but the site would never fail entirely if it were overloaded. Votes should never affect another record, they should always create a new record in your datastore to avoid locking problems.



That's exactly how it has worked for 10+ years. If you have no reddit cookie, the site will never be down (unless the CDN has failed too).

The problem is that when you do have a cookie, until recently there was no good way for the CDN to know if you needed data from the servers or not, so all requests with cookies got forwarded to the servers. In most cases, you'd still never hit the database on a home page load, except possibly to read your user data out and find out what subreddits you subscribe to. But you may have a failed page load because the servers were overloaded for other reasons.

I believe they are now fixing this by doing more on the frontend, allowing for caching of logged in requests as well.


Cool glad to hear from an insider about how things are/were run. Do you have any more insight on what your performance budget is/was for a typical request/response cycle? We aim for sub millisecond response time at the 99th percentile.


I haven't worked at reddit for seven years now, but back then, we didn't have a performance. budget. We tracked median and 99th for each api call, and then made targeted optimizations for the slowest requests, but we didn't have the resources to set a specific budget. The goal was just to make sure that the most popular requests remained as fast as possible and didn't get worse over time.


I think you are really confused here.

The problem with Reddit isn't the subreddit home pages it's the article comment pages. There are so many of them, people routinely comment/upvote months old articles and the distribution is unpredictable so it can be difficult to know what to cache.

And there isn't a datastore on this planet that could handle an event log like what you are proposing. Reddit would receive hundreds of thousands/millions of votes a second.


"Reddit would receive hundreds of thousands/millions of votes a second. "

Nitpick: According the the article, they receive 75 million votes a day. If their traffic was evenly distributed (which of course it isn't) that would be less than one thousand votes a second.


Nah definitely not confused about the architecture, it's fairly simple to wrap your head around. What I'm advocating for is making the general site semi-dynamic instead of fully dynamic. (Break expensive caches at regular predefined time intervals instead of every time data changes)

The key insight is that reddit still functions as you expect if you are seeing stale data from other people so long as your posts/comments/votes are visible when you navigate.

Shard the databases by subreddit and you will be able to horizontally scale the load. Use JS to post to a separate subdomain for each service. I.E everything under www.reddit.com/r/funny points to funny.reddit.com which can live as its own independent service.

Internally I would treat reddit as a group of individual high traffic sites instead of one massively trafficked site. Then you have another service that queries each site to build the main page.


  Shard the databases by subreddit and you will be able to horizontally scale the load. Use JS to post to a separate 
  subdomain for each service. I.E everything under www.reddit.com/r/funny points to funny.reddit.com which can live 
  as its own independent service.

  Internally I would treat reddit as a group of individual high traffic sites instead of one massively trafficked site. 
  Then you have another service that queries each site to build the main page.
this is terrifying to read, in that this is exactly what the media companies did for the last decade to try, and totally fail, to compete with facebook.

having a unified platform is a competitive advantage.

also your "sharding and subdomains" trick is straight out of 2010 and a mix of irrelevant (everything's sharded now) and anachronistic (h2 is everywhere now).

also-also, you clearly have missed the best use case of reddit, which is to have unsub'd from all/most of the "default" reddits and replaced them with your own customized set.

you are optimizing for everyone reading the same thing (newspaper homepage) not everyone reading their custom thing (social network). that has been a proven losing strategy for a solid decade now, you could argue 2.


Everyone is reading the same thing on reddit, just different combinations of the same thing. Reddit is just a combination of multiple newspapers masquerading as a single custom newspaper. No reason why the internals should leak to what is shown to the user. You could keep the exact same user experience that you have today and guarantee that the site never falls over.

You need to make the distinction between UX and Operations. Everything the user sees still lives under the same structure as today but internally it is structured so that you don't overload yourself and if you do fail, those failures are isolated and don't take down the rest of the site.


  Everyone is reading the same thing on reddit, just different combinations of the same thing.
this is categorically untrue, you don't seem to understand reddit.

  Reddit is just a combination of multiple newspapers masquerading as a single custom newspaper. 
that is exactly the anachronistic thinking that leads you down this path.

its movable type vs wordpress all over again. wordpress won.

the engineering constraint of pre-calculating caches is worse, in the long run, than the occasional outage for casual/non-logged-in users. its optimizing for the wrong audience, it is a long slow path to failure.


We're taking a sort of hybrid approach to this on https://threadbase.io, where you can get a subdomain for your community (or custom domain in the future) and put your own ads in (so moderators can make money). Eventually we'll create a shared homepage for site creators who opt in, and allow you to filter communities you want to see there.


> the distribution is unpredictable so it can be difficult to know what to cache.

Reddit's data is shockingly small[0], cache eviction should be a non-issue as it would be trivial to keep it all in ram.

[0] https://www.reddit.com/r/datasets/comments/65o7py/updated_re...


What I'm advocating for is not to cache in RAM but as a file in something like S3. Build the full HTML page for a time period and store it as a blob, serve it to the user and have JS take over on the front end. You can use local storage to cache information on the client's computer and purge that cache after another predetermined time period.

This way you are reducing the number of times you are busting the cache unnecessarily since it doesn't matter if you are getting up to the second vs up to the minute updates from reddit.


That seems like it would totally break the model, where you subscribe to sub-Reddits you want to read and don't subscribe to ones you don't want to read.


As written it would, but you could trivially adapt it to the subreddit level. Cache each subreddit the same way, then coalesce the results dynamically, which would still remove the main bottleneck of connecting to the db.


So we are keeping thousands of subreddits cached that then get dynamically composed together to determine an individual users front page? How did we just not create a second database?


I'm basically advocating for a system similar to Erlang Term Storage (ETS) but for those who don't use Erlang. Given that rewriting reddit in Erlang/Elixir would be unpalatable we can take the lessons from how other systems maintain high availability and apply them to our environment.


If I understand you correctly, there’s a new row entry for every vote ? I


I would absolutely expect every vote to be recorded in a row with at least up/down, by whom, and when. You can browse your own votes on posts, I don't know if there's a way to browse votes on comments.

https://www.reddit.com/user/USERNAME/upvoted/ https://www.reddit.com/user/USERNAME/downvoted/


You realise that Reddit is over 10 years old. There would have been trillions of votes since then.

What database are you thinking that could store all of those and then I assume aggregate them ? Kind of curious about this proposed architecture.


Postgres. And that's exactly how it has been. Every vote is a row that gets post-processed.


Wouldn’t that database be just positively massive? How fast does the database grow?


I would expect vote sums to be stored separately, you don't need to query a table containing all votes on every page load any more than a bank needs to query all transactions to show your account balance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: