I love reading about how companies scale their BigHuge data but it bothers me that we still haven't reached the point where scalability is a commodity instead of a patchwork of technology that everyone actor solves in their own way.
It's because "how to make a website scale" depends heavily upon which website, what it does, and how big it needs to be. Making a messaging queue like Twitter or G+ scale is very different from making Google Search scale. Hell, making the indexing system of Google search scale is a very different problem from making the serving system scale.
You can't really avoid having a patchwork of technology, because it's a patchwork of problems. Instead, there're a bunch of tools at your disposal, a few "best practices" which are highly contextual, and you have to use your judgment and knowledge of the problem domain to put them together.
>It's because "how to make a website scale" depends heavily upon which website, what it does, and how big it needs to be. Making a messaging queue like Twitter or G+ scale is very different from making Google Search scale. Hell, making the indexing system of Google search scale is a very different problem from making the serving system scale.
For most websites is not THAT different.
Actually, most have pretty similar needs, and you can sum those up in 3-5 different website architectural styles anyway.
There is far more duplication of work and ad-hoc solutions to the SAME problems than are "heavily different" needs.
News/Magazine/Portal like (read heavy),
Game site (evented, concurrent users, game engine computations),
Social Platform (read-write heavy),
etc.
Most needs are bog standard. If you really look at most successful sites they use might same-ish architectures, only with different components/languages/libs each.
Basically all high volume sites use something like the notions behind the Google App Engine, and the services offered. The various AWS tools are also similar (S3, the table they offer, etc).
I think you're missing a lot of complexity of the considerations that actually go into implementing any of the above. I can think of 3 subsystems within Reddit alone (reading, voting, and messages) that all have different usage patterns and (if they're doing it right) require different approaches to scaling.
Where's e-commerce on your list? The approaches for scaling eBay are completely different than for scaling Reddit or YouTube, because eBay can't afford eventual consistency. You can't rely on caching to show a buyer a page whose price is an hour out-of-date.
Here's something else to think about: why do (the now-defunct) Google real-time search and GMail Chat have completely different architectures, despite both of them having the same basic structure of "a message comes in, and gets displayed in a scrolling window on the screen"? The answer is latency. With real-time search, a latency of 30 seconds is acceptable, since you aren't going to know when the tweet was posted in the first place. With GChat, it has to be immediate, because it's frequently used in the context of someone verbally saying "I'll ping you this link" and it's kinda embarrassing if the link doesn't arrive for 30 seconds. Real-time search also has to run much more computationally-intensive algorithms to determine relevance & ranking than GChat does.
I've personally worked on Google Search, Google+, and Google Fiber. I can tell you that they do not all use something like the notions behind Google AppEngine. There's no way you could build Google Search on AppEngine, and G+ would be a stretch.
>I can think of 3 subsystems within Reddit alone (reading, voting, and messages) that all have different usage patterns and (if they're doing it right) require different approaches to scaling.
Yes, and those would be the same in all social bookmarking type sites, and similar (voting, etc) components of a social site a la Facebook, G+ etc.
>The answer is latency. With real-time search, a latency of 30 seconds is acceptable, since you aren't going to know when the tweet was posted in the first place. With GChat, it has to be immediate, because it's frequently used in the context of someone verbally saying "I'll ping you this link" and it's kinda embarrassing if the link doesn't arrive for 30 seconds.
Yes, so that's one use case for one architecture (low latency message queue), and the other is another. I gave the example of a Game site that also has similar latency concerns.
>Where's e-commerce on your list? The approaches for scaling eBay are completely different than for scaling Reddit or YouTube, because eBay can't afford eventual consistency.
My list wasn't supposed to be exhaustive -- I spoke of 4-5 common styles and only mention 2. That said, the approaches behind eBay might not resemble Reddit or YouTube, but they will resemble others like Amazon, Etsy, etc.
>I've personally worked on Google Search, Google+, and Google Fiber. I can tell you that they do not all use something like the notions behind Google AppEngine.
I don't think we mean the same things. For one, nobody makes search engines, or a google competitor. So what it takes to do Google Search is a moot point, when discussing common architectural patterns behind big sites.
I meant high level stuff on on hand like demoralised data, map reduce, workers, share nothing, sharding and such, and common infrastructure on the other hand, like the relational db, memcached, big-table like datastore, abstract filesystem (S3, BlobStore, GridFS, etc), ElasticSearch, Hadoop, Node, Redis, message queues, etc.
Google Search or Facebook might have needs way beyond those, but the above are shared by 99% of big sites out there.
A common system to build on top of them that is higher level than Heroku (and more expansive and accommodating than GAE) should exist, and it would cater more than 80% of big website needs. Of course each will need some custom stuff, but not 80% custom stuff.
Scalability can only be a commodity if all big sites were built the same way. But they never are. A scalable read-only system is entirely different from a system with a mix of read/write which is different again from a message routing system like Twitter or Facebook.
Well because every site are made different, and uses different tech, with different bottleneck.
But with SSD, RAM becoming a commodity, ( both prices has been dropping shapely ) I/O are much easier to deal with.
And with every release of PostgreSQL / MySQL, Easier Database Replication and more common practice of scaling we are much much better at it then say 2 - 3 years ago.
"Instead, they keep a Thing Table and a Data Table. Everything in Reddit is a Thing: users, links, comments, subreddits, awards, etc. Things keep common attribute like up/down votes, a type, and creation date. The Data table has three columns: thing id, key, value."
I hope they introduced some NoSQL sweetness by now.
They turned their RDBMS into a NoSQL database that's still slowed-down by all the relational machinery.
Though I'm sure this was an informed choice at the time, I seriously hope nobody finds this advice actionable anymore. Use Cassandra. (Or your NoSQL DB of choice)
They in an ad hoc way turned postgres into a key-value store, but in reality they have Cassandra running as a "permacache" in front of it and, in practice, everything really just hits Cassandra. It looks like postgres could at some point be phased out.
Source: ive hacked at the codebase to produce arxaliv
I remember when the Bigtable paper was released. It was very early in my career and I remember it sounding so alien to me. Sure, i had Memcached in my stack, but no SQL? It seemed like something they had to trade off to be able to build the kind of services they offer. I felt the same way after reading Dynamo.
Sure, I thought a lot about data design. I thought about usage patterns to inform how we denormalize. And I grew into using, eg, Gearman, to pre-compute dozens of tables every night. I evolved, a bit.
But a few years ago, a little before this OP was written, I had a great experience with some Facebook engineers and had an a-ha moment that has made me a much better software engineer. Basically, I realized that I needed to let my data be itself. If I have inherintly relational data, then it should be in a relational database. But I've built EAVs, queues, heaps, lists, all of these on top of MySQL and Postgres. Let that data be itself.
We have more options now than ever before. K/V stores, Column stores, etc. I use a lot of Cassandra. A lot of Redis. Some Mongo. And a put a lot more in flat files than I ever thought I would.
I know a lot of people that are smarter than me left the womb knowing these things. But for me it was transformational and has made me much happier. I realized how much energy I wasted fighting my own tools.
Does it mean that in a single project you may use two different storage solutions? SQL DB for relational data and NoSQL for queues, heaps, etc. ?
I'm asking because for me it is a kind of a paradigm shift, I have used plain files and SQL DB in the same project before, but never two different databases.
The other two commenters here beat me to it, but yes, absolutely. But be sure to right-size. If you don't have scalability issues, then your life is easier. Postgres has a k/v store option. Use that. Get more complex if you need to.
It's also perfectly acceptable to duplicate data -- store everything in Postgres or MySQL as your "master", cache all query results in Redis to avoid hitting the database whenever possible, and then build out lists/indexes/copies in a NoSQL database. Because yeah, corruption happens and it's great to be able to wipe and rebuild.
A common pattern would be.. i dunno.. think of a newsfeed.
* Each item is one entry in an "items" table, maybe with a "likes" int column.
* For each item, you have a sorted-set in redis of Likers.
* For every user you have a sorted-set for their newsfeed, with ID's of each item.
* When they hit the page, you load that sorted set, then query for the first 20 items from your database by ID (so it's a primary key lookup, and it probably won't even need to hit the Database because your redis cache layer will see it has that ID in cache already)
* You'd use redis MultiGet feature to reduce round-trip calls.
* When somebody "likes" something you write a job to a queue, which will eventually add the liker to the bottom of the sorted-set, then probably stores the details of the Like itself (timestamp, user, etc) to a slower datastore, which could even be a transactional log that you'd never read again unless you had some significant issue.
I've never actually built a newsfeed so there's certainly issues here, but maybe it will be helpful as a general idea. Remember, somebody, can't remember whom, said "The only truly hard things in computer science are cache invalidation and naming things" Cache invalidation (knowing when to go back to source data because your cache is stale) is still an "analog" problem. By that i mean, there's a thousand shades of grey there and you have to pick a strategy that works for you.
In the stack I am working on we have a variety of databases all serving a different type of data storage:
- memcache: for caching of data that doesn't persist
- redis: caching of data that needs a to be persisted short term but not on the longer term (eg. sessions)
- MySQL: for user-like data (account details, addresses, projects, ...)
- DynamoDB: for millions of data points that only needs to be queried in 1 dimension, so are not related or compared to one another. eg. give me all values from this table containing a given datatype, between 2 dates
- MongoDB: for millions of datapoints that need to be queried on deeper levels
- etc.
Great, thank you for the detailed overview. If you don't mind me asking, do you ever end up having problems with consistency? (I imagine it could happen if some data was written to 2 databases and the second one rejected the transaction.)
Well, the downside of using multiple DB stores is that the logic of keeping everything consistent is in the hands of the developer. So you have to make sure that everything is written correctly.
For instance, if you write to MySQL and Mongo, but your Mongo is down, you'll either have to queue the data item somewhere for a write once the system is back up, or you have a migration system in place that gets everything from MySQL since the downtime and writes it back to Mongo.
Depending on the type of data we have a few easing factors: for some data stores it is not that big of a deal if it doesn't get written to it's 2nd layer (eg. cache) as we can rewrite it the next time it is requested in layer 1.
Yes. Always use the right tool for the job. Its okay to have some duplication (an object stored in postgres and also part of a mongo doc or cached in redis). Don't forget storage is cheap.
> I hope they introduced some NoSQL sweetness by now.
Postgres has good support for being used as a key/value store. http://blog.reddit.com/2012/01/january-2012-state-of-servers... indicates they're still using Postgres for the main backend, and Cassandra next to it as both cache (it replaced memcached back in 2010) and to store data for some features (moderation log and flairs)
Their most recent infrastructure blog post was from January 2012, and shows that they're using Postgres 9, Cassandra 0.8, and local disk only (no more EBS). I'm curious if the recently-announced provisioned IOPS would enable them to go back to EBS.
Only disagreement (although I feel like I'm arguing w/ Linus about git) is don't memcache session data (lesson 5.) Memcache's 1mb max-block (exceeding that removes too many performance perks to be considered viable) introduces a "I need to constantly worry about my sessions getting too big" mental overhead that isn't worth it.
Even better: Don't use sessions. If there's data you're caching in a session, pre-compute it and store it in Redis (or other suitable NoSQL database).
Going statless wherever possible almost always wins. (And I only say "almost" because I'm not smart enough to know if it really always wins. But for me, it has)
> Even better: Don't use sessions. If there's data you're caching in a session, pre-compute it and store it in Redis (or other suitable NoSQL database).
What do you mean by data you are caching in a session? Session is for maintaining state viz. logged in or not, language preference etc.
How is storing in Redis(or any other data store) any different from storing in Memcache?
> Going statless wherever possible almost always wins.
Using sessions with a data store which can be accessed from multiple machines(no files or in-memory stores) is stateless. The simplest implementation will have a signed cookie with a session id. Your application will have a hook to load the user session from the backing data store before processing the request.
I know about memcache key eviction. For most of the applications, that is an acceptable compromise. For others, back up the sessions in the db every n seconds as mentioned in the article. Session is loaded every request and hitting the db every request is not something I want.
Once you move from storing session data in-memory on the webserver, and add a network call, why not just store the data alongside the users other data in a fast datastore? That could be Redis, Cassandra, whatever.
This isn't babble -- it's honestly a pretty common technique. If your site sees millions of users, storing a session for visitors that aren't logged in is prohibitive and unnecessary. You can store UI customizations and basic memoization in a client-side cookie if you need to.
If you're building something small or basic, then you probably won't have multiple webservers and you can use fast in-memory sessions without concern. This only applies once you need to worry about scale.
A cookie will eat up latency on every request. If the amount of data is non-trivial (and UI customizations can be, there're some fairly demanding users, and oftentimes it's helpful to store a history of what they've done recently so you don't prompt them for things they have no interest in doing), that can be a big usability hit, and even run afoul with HTTP request-size limits.
Using local storage or manipulating the path to avoid the cookie being sent to the server are a solid option if you need to store UI state for non-logged-in visitors of a highly trafficked site.
Obviously solutions are not zero-sum but having to deal with a session store that brings its own latency, state, garbage collection issues, etc, is something I've found to be sub-optimal. But to be clear: We're talking about specialized techniques that don't apply to most websites.
Though most of my experience is at a social network (not the social network) and building out advertising platforms. I've certainly never had to worry about storing so much UI state data that I'd be concerned about its size. Your experience seems a bit different and obviously at this level everything is subjective.
> Once you move from storing session data in-memory on the webserver, and add a network call,
I don't move from storing session data in-memory; I never store session data in-memory. Well, I do indirectly but that's always some distributed memory system. And there isn't necessarily a network call. A single webserver accessing memcache running on the same machine is as fast as accessing in-memory cache. When there is more than one memcache server, there has got to be a network call, but I am yet to see an application where that is an overhead. Compared to disk access, that is a floating point error.
> why not just store the data alongside the users other data in a fast datastore? That could be Redis, Cassandra, whatever.
What advantage would I have from storing the session alongside the user data? My datastore isn't Redis, Cassandra. It's postgresql. Redis is my cache/data structure server. The only nosql solution which I would consider for user data is mongo.
> This isn't babble -- it's honestly a pretty common technique.
Other than your insistence on using Redis, I don't see how your technique is any different from mine.
> If your site sees millions of users, storing a session for visitors that aren't logged in is prohibitive and unnecessary.
If the user aren't logged in and don't have any user data, there is no session to be loaded. When the user logs in, set the secure cookie with the session id. When a new request comes in, if the secure cookie has the session id, load it. For a not-logged in user, there is not session id and nothing is loaded.
> You can store UI customizations and basic memoization in a client-side cookie if you need to.
The only thing I will store in the cookie is session id. Cookie length constraints and network payload for every request is more overhead than loading the session on the server side.
> If you're building something small or basic, then you probably won't have multiple webservers and you can use fast in-memory sessions without concern. This only applies once you need to worry about scale.
For sessions, session id in cookie and memcache/redis as session store works for all scale.
If you've got a single webserver then we're speaking a different language. That's not an insult or anything, it's just that I'm talking about techniques we use to support millions of pageviews a day.
There is no one way to build a system of course. I'd love to hear more about your lessons-learned, but this really isn't a good forum for that. Though even at your scale you should consider redis as a drop-in replacement for memcache. It benchmarks faster in many cases, and has support for data structures (lists, sorted sets, etc) that make your life easier.
And see my comment below clarifying what you said about cookies.
> For sessions, session id in cookie and memcache/redis as session store works for all scale.
> If you've got a single webserver then we're speaking a different language. That's not an insult or anything, it's just that I'm talking about techniques we use to support millions of pageviews a day.
And where did I talk about single webserver? You said "Once you move from storing session data in-memory on the webserver, and add a network call,", to which I said I never do in-memory session, even for a single webserver. Get over yourself - you aren't a special snowflake who deals with more than one more webserver.
> Though even at your scale you should consider redis as a drop-in replacement for memcache. It benchmarks faster in many cases, and has support for data structures (lists, sorted sets, etc) that make your life easier.
"Though even at your scale "
As I said, get over yourself. All you have offered is somehow storing session in Redis makes you scale.
How on earth would you know what scale I
am talking about? I don't remember mentioning it.
And I know what redis does. Cargo cult mentality viz. "redis is really better than memcached" are the main reason behind fucked up systems.
> And see my comment below clarifying what you said about cookies.
Yes, I saw your comment. Local storage is not the solution for coming year or so. I am paid to design systems that work, not systems that might work.
> For sessions, session id in cookie and memcache/redis as session store works for all scale.
>> Kind of a bold statement? GL with that.
I don't know where you are getting your numbers from, but million pageviews/day as you mention again and again is a very nominal number for a generic webapp(unless you are very cache unfriendly viz. reddit). That isn't something you even have to think about. A standard rails app sitting behind nginx with 4-5 webservers will do it just fine.
And for the last time, "session id in cookies and then load the session before request" fucking works for every one including facebook and google. The only thing that differs is choice of session store, and no, redis isn't the catchall solution. For most of the cases, memcache is faster.
You have taken this all very personally. I apologize for offering a different take on system design than what you apparently believe in very strongly? I've said a few times there's "no one way."
The systems we've built have served over-billion-page-view months. That's not common or easy. HN is a site that values a back-and-forth about that kind of experience and I'd have loved to hear your tips thrown-out there too. But this has become some strange ego thing for you so it's time for me to bow out.
> You have taken this all very personally. I apologize for offering a different take on system design than what you apparently believe in very strongly?
No, you haven't offered anything other than "you should use Redis". You started with "if you are caching in sessions". I don't know where you get the idea the sessions are caches.
> The systems we've built have served over-billion-page-view months. That's not common or easy.
Good for you. But have you actually used local storage for storing user specific data? And have you compared using redis and memcached for session storage? I don't have a problem with difference in opinion - it's just that your opinions aren't valid.
Well yeah it's from early 2010, it's been 2 years. In January (2012)'s "state of the servers" post, the reddit blog noted traffic was 2 billion pageviews per month in December 2011, up from 870m or so in December 2010.
"Because it happens" is not a very good technical answer. Anything can happen, even 1GB session data.
But even anything approaching even half a MB of session data I would take as an architectural failure, and work to fix it, instead of letting it dictate what tools I use (ie. redis vs memcached).
>The point is that there are solutions offering equivalent speed without that barrier, so why select a technology that has limitations?
Because all decisions have tradeoffs and it more wise to depend on your particular use case to choose than to depend on a scenario (> 1MB session data) that can only happen with a seriously fucked up application design.
Why in sessions, though? Why not cache everything as uniquely-identified resources and store only the resource ID in the session? That way, resources can be shared across different sessions.
I'm trying to think of a case where a user's actions (that's what's supposed to be in sessions, right?) can result in 1M of data and drawing a blank. Maybe if they're uploading images or video transiently, like in Google's Search-by-Image feature, but even that seems like a case where you should upload the image permanently and just store a reference to it in the session.
Edit: Answered my own question: oftentimes it's useful to store a history of the user's recent actions in their session, because then you can build a more responsive, more tailored UI. You can save transient data they've already entered, avoid showing them promos they've already indicated no interest in, and potentially use it to predict useful things to show them next. Even then, though, it'd take 500 pageviews at 2K apiece (which is a very generous estimate) to hit a meg of data.
NodeJS is insanely easy to do that with, it's one of the best reasons to use it in my experience:
var queue = [];
// request handling
module.exports = function(request, response) {
queue.push("bla");
response.end("bye");
};
// out-of-request work
setInterval(function() {
// do something with your queue either locally
// or moving it to an external job queue that can
// leverage the same script on a different thread
}, 1000);
It comes down to whether you're going to dominate node's thread or not, if it's heavy or of course blocking then you'll have to do push that work off to be handled by a different thread.
In my case I periodically query mongodb for new or updated data since the last time to be stored locally, push batches of data into redis, track and save load information etc all without any negative repercussions.
There are 2,592,000 seconds in a 30 month day so actually that's more like 104 page views per second. But that also does not count peaks (peak is likely 250+) and that each page view requires multiple queries.
Either way, no need to hate on the traffic figures of what everyone knows is a very large website.
> At the peak of the IAMA reddit was receiving over 100,000 pageviews per minute.
> In preparation for the IAMA, we initially added 30 dedicated servers (20%~ increase) just for the comment thread. This turned out not to be enough, so we added another 30 dedicated servers to the mix. At peak, we were transferring 48 MB per second of reddit to the internet. This much traffic overwhelmed our load balancers which caused a lot of the slowness you probably experienced on reddit.