I love reading about how companies scale their BigHuge data but it bothers me th...

nostrademons · on Sept 2, 2012

It's because "how to make a website scale" depends heavily upon which website, what it does, and how big it needs to be. Making a messaging queue like Twitter or G+ scale is very different from making Google Search scale. Hell, making the indexing system of Google search scale is a very different problem from making the serving system scale.

You can't really avoid having a patchwork of technology, because it's a patchwork of problems. Instead, there're a bunch of tools at your disposal, a few "best practices" which are highly contextual, and you have to use your judgment and knowledge of the problem domain to put them together.

batista · on Sept 2, 2012

>It's because "how to make a website scale" depends heavily upon which website, what it does, and how big it needs to be. Making a messaging queue like Twitter or G+ scale is very different from making Google Search scale. Hell, making the indexing system of Google search scale is a very different problem from making the serving system scale.

For most websites is not THAT different.

Actually, most have pretty similar needs, and you can sum those up in 3-5 different website architectural styles anyway.

There is far more duplication of work and ad-hoc solutions to the SAME problems than are "heavily different" needs.

nostrademons · on Sept 2, 2012

What would be those 3-5 different website architectural styles?

batista · on Sept 3, 2012

News/Magazine/Portal like (read heavy), Game site (evented, concurrent users, game engine computations), Social Platform (read-write heavy), etc.

Most needs are bog standard. If you really look at most successful sites they use might same-ish architectures, only with different components/languages/libs each.

Basically all high volume sites use something like the notions behind the Google App Engine, and the services offered. The various AWS tools are also similar (S3, the table they offer, etc).

nostrademons · on Sept 3, 2012

I think you're missing a lot of complexity of the considerations that actually go into implementing any of the above. I can think of 3 subsystems within Reddit alone (reading, voting, and messages) that all have different usage patterns and (if they're doing it right) require different approaches to scaling.

Where's e-commerce on your list? The approaches for scaling eBay are completely different than for scaling Reddit or YouTube, because eBay can't afford eventual consistency. You can't rely on caching to show a buyer a page whose price is an hour out-of-date.

Here's something else to think about: why do (the now-defunct) Google real-time search and GMail Chat have completely different architectures, despite both of them having the same basic structure of "a message comes in, and gets displayed in a scrolling window on the screen"? The answer is latency. With real-time search, a latency of 30 seconds is acceptable, since you aren't going to know when the tweet was posted in the first place. With GChat, it has to be immediate, because it's frequently used in the context of someone verbally saying "I'll ping you this link" and it's kinda embarrassing if the link doesn't arrive for 30 seconds. Real-time search also has to run much more computationally-intensive algorithms to determine relevance & ranking than GChat does.

I've personally worked on Google Search, Google+, and Google Fiber. I can tell you that they do not all use something like the notions behind Google AppEngine. There's no way you could build Google Search on AppEngine, and G+ would be a stretch.

batista · on Sept 3, 2012

>I can think of 3 subsystems within Reddit alone (reading, voting, and messages) that all have different usage patterns and (if they're doing it right) require different approaches to scaling.

Yes, and those would be the same in all social bookmarking type sites, and similar (voting, etc) components of a social site a la Facebook, G+ etc.

>The answer is latency. With real-time search, a latency of 30 seconds is acceptable, since you aren't going to know when the tweet was posted in the first place. With GChat, it has to be immediate, because it's frequently used in the context of someone verbally saying "I'll ping you this link" and it's kinda embarrassing if the link doesn't arrive for 30 seconds.

Yes, so that's one use case for one architecture (low latency message queue), and the other is another. I gave the example of a Game site that also has similar latency concerns.

>Where's e-commerce on your list? The approaches for scaling eBay are completely different than for scaling Reddit or YouTube, because eBay can't afford eventual consistency.

My list wasn't supposed to be exhaustive -- I spoke of 4-5 common styles and only mention 2. That said, the approaches behind eBay might not resemble Reddit or YouTube, but they will resemble others like Amazon, Etsy, etc.

>I've personally worked on Google Search, Google+, and Google Fiber. I can tell you that they do not all use something like the notions behind Google AppEngine.

I don't think we mean the same things. For one, nobody makes search engines, or a google competitor. So what it takes to do Google Search is a moot point, when discussing common architectural patterns behind big sites.

I meant high level stuff on on hand like demoralised data, map reduce, workers, share nothing, sharding and such, and common infrastructure on the other hand, like the relational db, memcached, big-table like datastore, abstract filesystem (S3, BlobStore, GridFS, etc), ElasticSearch, Hadoop, Node, Redis, message queues, etc.

Google Search or Facebook might have needs way beyond those, but the above are shared by 99% of big sites out there.

A common system to build on top of them that is higher level than Heroku (and more expansive and accommodating than GAE) should exist, and it would cater more than 80% of big website needs. Of course each will need some custom stuff, but not 80% custom stuff.

batista · on Sept 3, 2012

Several typos, sorry, here's a particularly bad case:

"I meant high level stuff on _one_ hand like _denormalized_ data."

NelsonMinar · on Sept 2, 2012

Scalability can only be a commodity if all big sites were built the same way. But they never are. A scalable read-only system is entirely different from a system with a mix of read/write which is different again from a message routing system like Twitter or Facebook.

batista · on Sept 3, 2012

That doesn't mean they cant take advantage of ready-made "functional architectural components" for each of their needs.

So they can pick and match from the "scalable read only" commodity system, the scalable "message routing system" etc.

joering2 · on Sept 2, 2012

That's because 99% of internet websites will never see 200 million requests.

jeremyjh · on Sept 2, 2012

Much fewer than 1% will.

ksec · on Sept 3, 2012

Well because every site are made different, and uses different tech, with different bottleneck.

But with SSD, RAM becoming a commodity, ( both prices has been dropping shapely ) I/O are much easier to deal with.

And with every release of PostgreSQL / MySQL, Easier Database Replication and more common practice of scaling we are much much better at it then say 2 - 3 years ago.