Tumblr: Hashing Your Way to Handling 23,000 Blog Requests per Second

sargun · on Aug 4, 2014

So, it looks like they're using BGP to "program" the adjacent router a la "SDN." I'm curious as to two things:

* Byzantine fault-tolerance: How does this system handle failures when a specific node fails in a way that it fails to withdraw its routes. When a node's haproxy fails, how is BIRD informed of its failure? What if the failure is in some way that internal fault detectors don't see the failure.

* How is the ECMP hashing problem handled? ECMP hashing on most gear is just a plain hash, that means when a route is withdrawn, the rest of the systems see their traffic rebalance. How does this not result in all connections being severed?

nbm · on Aug 4, 2014

I'm not sure about how Tumblr might do it, but one can use a combination of ECMP and ipvs (with a consistent hash) to do the lower-layer L4 load balancing. This means that even if one of the L4 load balancers go down and the connections originally going to that L4 load balancer by the switch/router get moved to another one, the consistent hash to the L7 load balancer handling the request means the connections will not be reset (except in some interesting and less-frequent cases).

This two-stage process also allows for good health checking from the much-simpler Linux ipvs L4 load balancer servers to the more-complicated L7 load balancer servers.

This was described in this Velocity 2013 talk - http://velocityconf.com/velocity2013/public/schedule/detail/...

sargun · on Aug 5, 2014

Yeah, this is how Facebook does it, and how we did it at Microsoft / Yammer.

donavanm · on Aug 5, 2014

There was a related story of ECMP load balancing a while back. There were good comments on some of this benefits & challenges, https://news.ycombinator.com/item?id=7811412.

To your first point use an external node, outside te data path, as your control plane.

To the second the simplest is to mange layer 3 ecmp on top of mobile layer 2 address assignments. Think bgp to carp'd next hops. Depending on your router implementation you also have more choices than simple 5 tuple for the ecmp hash inputs.

aaron42net · on Aug 4, 2014

ECMP is used heavily in serving Anycast DNS worldwide; it's only the application to HTTP that's somewhat new. A high-end router can do stateless ECMP at 10s or 100s of millions of packets per second; it's hard to find load balancers that can compete with this.

BGP has heartbeats at usually 15 to 45 second intervals depending on configuration. If BIRD stops responding, the router will withdraw its routes.

BIRD can be controlled via a Unix socket. Usually people build a health-check daemon that does queries against the local app and communicates its finding to BIRD. Working through all of the failure modes here is tricky, but doable.

The ECMP hash is often implemented as something like a CRC-16 of (protocol, source IP/port, dest IP/port) modulo the number of next-hops. I suspect the trick to keep TCP happy is to try keep the number of next-hops (shards) constant for each route.

sargun · on Aug 5, 2014

At least on most chips today, they use modulo-N hashing, which results in potentially breaking existing connections, see: http://tools.ietf.org/html/rfc2992

The reason that this isn't really a problem on the internet, is because you're typically not using ECMP, and just plain anycast.

cagenut · on Aug 4, 2014

Hi Michael, you wrote a whole post about varnish cache management without mentioning hit-rates! How effective was all this? How many of that 23K req/s did origin have to handle?

mschenck · on Aug 4, 2014

The cache effectiveness of this is actually quite good, but I'm deliberately being ambiguous about the ratio.

The reason for my ambiguity is that our cache-hit ratio is actually a result of our application. This architectural design afforded us the ability to maintain our (very high) cache-hit ratio, even when we outgrew the total slab size of a single varnish node.

That ability to maintain the same cache-hit ratio, the motivation for this effort, is the result of not needing to evict cached content prior to TTL.

So, if you have a low cache-hit ratio due to eviction (and you don't just have excessive TTLs), then your cache pool is probably too small. If so, then you might want to give this design a shot - it's an analog to using ketama for growing memcache beyond a single node.

CoffeeDregs · on Aug 4, 2014

Random note on on scale-out, caching, etc: check out CloudFront's "whole site delivery". You can set min TTLs to zero on CloudFront, configure your cache headers correctly and get many of the benefits outlined in the article. See:

   http://www.allthingsdistributed.com/2012/05/cloudfront-dynamic-content-support.html

   http://blogs.gartner.com/lydia_leong/2012/05/14/amazon-cloudfront-gets-whole-site-delivery-and-acceleration/

WARNING: CloudFront will dutifully cache non-2xx responses, so you can get a long-lived, but very fast, 500 response...

jgrahamc · on Aug 4, 2014

Or you could not pay Amazon "per byte fees" and use CloudFlare and get the same the type of functionality through out caching options and Page Rules.

bobjenk · on Aug 4, 2014

If I was to use Varnish, would that make my need for Redis obsolete - should I switch to a non-memory based key-value db and use Varnish on top of that?

My current stack: nginx/php-fpm/redis. Nginx and Redis serve me well, but php-fpm makes my website rather slow with high volumes of traffic, so I believe the solution for that would be Varnish(?).

codingjester · on Aug 4, 2014

so @ tumblr, we're using varnish for a full page cache (we use it for parts of the API as well for response caching), and invalidate when a blog updates (or your page can just TTL out).

I definitely agree that I wouldn't use Redis (or memcache for that matter) for storing entire pages and should be used for more of an object cache. Even then, we use memcache for "simple" data structures and when we need more complex data structures, will use Redis.

Redis is great if you need some kind of persistence as well (and it's fairly tunable), where as memcache and varnish, are completely in memory (varnish 4.0 I believe is introducing persistence). So you kick the process, and that's all she wrote for your cache until it gets warm again. (Which has its own challenges).

Varnish also gives you a language called VCL to play around with to maximize your caching strategies and optimize the way varnish should be storing things. It's got an easy API to purge content, when you need to purge it and it should support compression for your pages out of the box without too much tuning.

If you're having issues just speeding up static content, give varnish a whirl. Spend some time with it, and you won't be disappointed.

I believe you can also look into using nginx as a caching alternative to cache responses, but I don't have too much experience with that. I've heard it used with some success though.

dallasmarlow · on Aug 7, 2014

varnish has supported disk based caching for quite a long time, it's just that all the instances at tumblr are configured to only use memory.

jedberg · on Aug 4, 2014

No. Redis caches single objets, Varnish caches whole pages (unless you're using Redis as a full page cache).

If you're using Redis as your database, I'd suggest not doing that anyway, as you'll start running into problems as your dataset gets bigger than available memory and it has to start swapping. I've found it works much better if you use it like memcache with a richer set of data structures.

nkozyra · on Aug 4, 2014

Tough to quantify "obsolete." You can put caching layer upon caching layer upon caching layer, but diminishing returns kick in almost immediately.

Ideally you have one caching model to rule them all, unless you're doing a lot of module-specific caching (ie, this element should be cached longer than this element, etc.)

bobjenk · on Aug 4, 2014

Well at the moment, I would consider Redis more useful than Varnish due to expired events. As to follow the "one caching model to rule them all" rule, how about using ngx_lua to dynamically change a static .html file with the data from Redis thus cutting out php-fpm (unless they are logged in, then requests will go directly to php-fpm)?

Or use Varnish, use Redis for expired events but with an empty value and use keyspace notifications to automatically remove the data related to that key from the database and purge Varnish.

andremendes · on Aug 4, 2014

>278 employees. That's a lot of people, I imagined they were way less.

mschenck · on Aug 4, 2014

I'm sorry about this confusion, 278 is total Tumblr employees.

The team responsible for all of Tumblr's perimeter (Perimeter-SRE) is comprised of 6 people (including one manager).

This article is describing the architecture of the portion of our perimeter responsible for blogs serving, one of our more highly trafficked perimeter end-points.

billmalarky · on Aug 4, 2014

Keep in mind that not all are engineers. They appear to do sales in house.

rmangi · on Aug 4, 2014

Our total engineering staff is around 120 and we're looking to hire more.

chojeen · on Aug 4, 2014

Also, some Yahoo employees might have transferred over after the acquisition.

elliottcarlson · on Aug 4, 2014

Not really - especially on the tech side.

maddalab · on Aug 4, 2014

Apologies if this is not of interest.

If you found this interesting please checkout the jobs page [1] at Tumblr, we are constantly looking for new folks. Specifically [2] for positions on the teams that implemented everything described in the article.

[1] https://www.tumblr.com/jobs [2] http://boards.greenhouse.io/tumblr/jobs/17886