This is pretty interesting. I mean, it seems pretty intuitive and obvious once it's put out there like that. I currently deal with a few hundred million items that need to be delivered in real time, and we use a somewhat similar structure, although our algorithms for cache invalidation are much less sophisticated. I wonder how much effort it would take (and I can put forth) to improve my own system to be more efficient.
In other words, I wonder how much of this efficiency boost is due to FB's abilities (both in people and technology) to scale. The paper seems to imply that it's relatively simple, now that the data has been gathered, but for a one-person team like mine, I wonder what benefits I can take away from this.
I work with the Facebook CDN team and (amongst other things) maintain the data pipelines that log requests, the tailers that fetch/annotate the requests, and populate the Hive tables that we used for this research and other improvements to serving content.
There is certainly a lot of infrastructure that this is built on that many small teams don't immediately have access to in other companies. Whether it's self-service hardware provisioning, Scribe logging infrastructure, tailer frameworks with checkpointing and retries (and job systems to schedule them), and large amounts of available space on Hive for experimentation. But most of the software parts are available as open source, so it doesn't need remain unavailable.
This is my first team at Facebook that I've been heavily involved in this scale of data capture and analysis, but it only took a few days to get up to speed through a combination of great tools and good documentation. Being able to drop a Python file in a code repo to ensure that some complex data warehousing task takes place every day after that is pretty powerful.
The short, uninteresting answer is that it's a work in progress. Initially we built FIFO into our caches because it was easy to build and didn't interact badly with flash disk craziness (write amplification specifically). You can read more about our old caching system mcdipper at https://www.facebook.com/notes/facebook-engineering/mcdipper...
Yes, mostly due to new content. You have to miss on the first request for the new content, and each of the 350 million or so photos a day will contribute a miss. Also, this only ran over a particular time period, so one-off requests for older content (someone revisiting their old photo albums) would be misses too.
I'm not sure whether a "refresh" (ie, 304 not modified due to expiration) is counted as a miss in this data.
I'm not sure if Wyatt, Qi, et al. looked at pre-caching at all. I've not looked at the numbers yet, but it's on my list of things to look into once the next few months of more obvious wins are worked through.
Inside Facebook, the top 1000 daily photos is known as the "Buddha List", because viewing them is a transcendent experience, after which your waking life is little more than crude construct of colorless shapes.
And then you get fired for looking at private user photos.
In other words, I wonder how much of this efficiency boost is due to FB's abilities (both in people and technology) to scale. The paper seems to imply that it's relatively simple, now that the data has been gathered, but for a one-person team like mine, I wonder what benefits I can take away from this.