How Facebook deals with memcache consistency in multiple datacenters

thorax · on Aug 21, 2008

I've always been curious why someone hasn't (or why I haven't heard of someone) prototyping an improvement to MySQL's query cache to invalidate the cache less frequently and get more dependable usage. Currently it invalidates when any table referenced by the query changes. It seems like you could improve this algorithm without too much trouble. For example, it feels like if the query was located solely through indexes, it should be able to do cache invalidation on an update to those index hits rather than on the entire table.

I'll have to look through the code one day and see why this makes no sense.

thedob · on Aug 20, 2008

Interesting use of the 20-second cookie. What happens when they scale to data centers in Europe, Asia, and South America? Does it become a arbitrary-second cookie depending on where the user is?

Nonetheless, I can appreciate how a simple approach solves the problem. If these are the most dramatic scaling issues that facebook faces, then this provides strong support to the argument that you shouldn't focus too much on scaling and optimization when first building your site. There are probably better uses of your early time.

mseebach · on Aug 20, 2008

I'd guess that they'd need to partition their data somehow by then, so the Europe DB is master for European users and slave for the rest, and so on.

smoody · on Aug 20, 2008

note to self: don't start companies that require multiple synchronized data centers. :-)

paul · on Aug 21, 2008

Actually, these are really interesting problems. Multi-homing isn't an issue for most startups though (not that this is true multi-homing since their second DC is read-only, meaning that they can't really survive the loss of the primary).

tlrobinson · on Aug 21, 2008

Presumably if there was a major failure at the primarily data center they could quickly switch over the secondary one to be the master.

But yes, I suppose it wouldn't be considered true multi-homing.

paul · on Aug 21, 2008

But due to the replication delay, that would mean losing data on the primary, and a long delay on bringing the old primary back in sync with the new (since they have effectively forked). Therefore they would only switch the the secondary if the failure at the primary site was catastrophic -- it's more like a backup.

mdasen · on Aug 20, 2008

While there's a ton of talk about scaling on here, it isn't needed by the vast majority of people and even Facebook's way isn't that complicated when you get down to it.

Basically, the only change that needs to be done is that you have MySQL trigger the delete from memcached on replication. All your slave database gets an updated record for you, saves it in its tables and then hits memcached with a delete command for that key. While it does mean diving into the MySQL code, when you get to the size at which you need such a feature, you can just hire a MySQL expert.

goodtimescaling · on Sept 16, 2008

This is probably a bad idea in a lot of cases and fb replication probably is rather complicated, most likely relying on some sort of system that relies on stale data when it can and asks for fresh data when it needs it. Number one if you're high scalability then you don't want to run a query when you already know how the underlying data has changed so any kind of invalidation trigger would be ill advised. Further, if you're replicating data over a large distance what's the chance that you'll ever have the right data in the right state. From the moment the data is changed on the master you probably need to fire off an invalidation, if not an update to that data, which is where some sort of collection class to manage this is probably more suitable. Invalidation will often quit to work well, even in a small replicated environment, if replication takes more than even a very small amount of time, which happens quite easily if a server is overloaded. Imagine I change my friends, wait on my process for replication, invalidate and some friend of mine views my page and re-caches my friend list from a mysql server that hasn't yet replicated my newest data. It's fairly solvable but even then you aren't sure that an update will go through on memcache so you have to have some sanity check to know whether you think the data is actually what you want to be seeing. I guess I don't really have a point but just more open ended problems for consideration for those of you wishing to scale. I think it's all pretty common sense but it's by no means simple. Best of luck.

cbetz · on Aug 21, 2008

The reasons there is a ton of talk about scaling here: a) everybody wants to start their own Facebook someday so they think they need to know this stuff AND b) scalability is actually kinda neat to study. Reorder them as you like but I think they are both factors.

That being said we are rapidly moving towards a world with more and more data being stored and queried. Scalability will therefore be something you increasingly need to know about in order to build a significant application -- regardless of your entrepreneurial ambitions.

tlrobinson · on Aug 21, 2008

Yeah I was surprised at how simple their solution sounds.

But then I realized I was imagining a single MySQL DB at each datacenter. In reality they must have pretty big clusters at each.

fleaflicker · on Aug 21, 2008

There's another easier way built into memcached:

When you issue a delete you can specify "the amount of time the client wishes the server to refuse 'add' and 'replace' commands"

Maybe they couldn't afford to increase the read load so much but this seems easier than rewriting mysql's replication engine.

ashu · on Aug 21, 2008

That would not be guaranteed to be correct since you don't know the precise time required for the dbs to be consistent.

dwwatk01 · on Aug 21, 2008

memcached != memcache

golanzakai · on Nov 11, 2008

http://golanzakai.blogspot.com/2008/11/memcached-replication...

Here is a setup where you can scale with memcached

smakz · on Aug 21, 2008

Just seems hacky to me. Opening a whole data center to shave off 70ms? Now I see why they need 500+ million.

akd · on Aug 21, 2008

70ms can really affect the time people spend on a site.

smakz · on Aug 21, 2008

You know what takes longer then 70 ms?

All the js/css/other stuff facebook includes on their homepage (one server round trip each).

Before they spent the millions it takes to bring up another data center they should have read this:

http://developer.yahoo.com/yslow/help/#guidelines

Also, I disagree that 70ms makes that big of a difference. People are more used to slow sites then people realize - anything less then 200+ ms improvement in the aggregate and I wouldn't spend a dime (not to mention millions of dollars).