Load Balancing is Impossible [video]

cryptica · on Dec 9, 2016

It seems that the speaker is discussing small servers; 150 concurrent requests per server/process sounds very small to me (but maybe this is common in enterprise?); in this case, I can see why the variance introduced by the random algorithm is such a problem.

But if you have an efficient system where each server can handle 10K+ concurrent requests, then the variance introduced by the random algorithm becomes insignificant. Also, it doesn't matter that some requests take much longer than others; if you're dealing with a lot of users per server, the slow requests will be distributed evenly too and average out.

I implemented the balls-and-bins algorithm and I fond that when dealing with few buckets and a large number of balls, the variance in load between buckets became smaller. I even added randomness to the 'weight' of each request (to account for slow vs fast requests) and the distribution was still even - The more requests each server can handle, the more even the distribution is with the random algorithm.

keegancsmith · on Dec 9, 2016

10K+ concurrent requests to a single server sounds like a server that is mostly shuffling bits around. I assume the 150 concurrent is more realistic for a frontend server which actually does something.

Also note the speaker is a CTO of a CDN (fast.ly), I am guessing he has experience with large concurrent requests as well :)

user5994461 · on Dec 9, 2016

10k requests/s is the standard I expect to achieve with a HTTP load balancer without any challenge. (HAProxy or nginx)

It may be less with HTTPS, depends on how efficient the CPU is at cryptographic algorithms and how many cores.

fastly is a CDN. I assume that they're talking about the performances of their CDN servers. Not the web servers of the clients that will process the request, and is indeed a lot slower operation.

cryptica · on Dec 10, 2016

Yeah I guess I'm mostly thinking of a typical REST API workload where you just query a database, maybe do some small transformations to the data and then send it to clients.

Also, it'd have to be a server with good async support like Node.js, Go, Haskell, Scala, Tornado, nginx...

Considering that the speaker is CTO at a CDN company (which has to deal with a lot of different kinds of back-ends that are outside of their control), it makes sense that they would need to use and algorithm which can handle all possible scenarios - They can't force their customers to use bigger servers and more efficient systems.

chrishacken · on Dec 9, 2016

> 10K+ concurrent requests to a single server sounds like a server that is mostly shuffling bits around.

Isn't that all a load balancer is supposed to do? I certainly don't want my load balancers performing computations or logic. I want it to pass that work off to another server.

Jweb_Guru · on Dec 10, 2016

If you actually intend to serve traffic back to the original requestor, you can't have too many requests handled per server because you have a limited amount of bandwidth. 150 requests per server on a 1Gbps pipe (which is what you get on, for instance, a single ELB instance) only actually gives you about 813 KB/s per request for 150 requests/server. For 10,000 requests you'd be down to 12 KB/s per request. For comparison, the average web page size is now up to 2048 KB!

To be clear, you can do a lot better with a better pipe, smart caching, compression, etc. But people often have horribly unrealistic estimates about how much traffic their servers can handle because they don't take bandwidth into account, and load balancers are no exception.

ec109685 · on Dec 10, 2016

The HTML for an average web page is not 2048KB

Jweb_Guru · on Dec 10, 2016

The average web page size is more like 2331KB now: http://www.httparchive.org/interesting.php?a=All&l=Apr%201%2.... (I didn't say anything about HTML, by the way).

Of course, when you break it down by individual web request, most responses are still below 800KB, but you shouldn't load plan for the average case. And clearly even the average case is well above 12KB, especially for a CDN (which is responsible for serving the image, video, and large script content). I'm also pretty confident the page I linked already includes compression (which decreases size, but can increase time quite a bit; many people expect software load balancers to be using the absolute fastest compression available, but that's often not the case in my experience).

ec109685 · on Dec 13, 2016

But most of those resources are static and cacheable.

Jweb_Guru · on Dec 15, 2016

Sure, if you actually set the headers correctly and the browser actually retains them (recall that browsers are free to discard their caches at any time). And, if they can't be cached on the user's computer, who caches them? That's right, the CDNs. Like fast.ly. Who wrote the article.

ec109685 · on Dec 20, 2016

Regardless, the small pipe on your web server isn't serving the assets.

devy · on Dec 9, 2016

One of the approaches and the last Tyler, the speaker, introduced in the talk to improve the fat tail distribution of load balancing latency problem was Join-Idle-Queue algorithm[1] that came out of UIUC Extreme Computing Group, Microsoft Research and Azure in 2011.

And according to Tyler's data (fast.ly CDN data since he's the CTO there), it beats all the current load balancing algorithms including Join-Shortest-Queue. Six years later, I wonder:

a) was there any other novel / better algorithms tops JIQ in LB performance?

b) has Microsoft Azure LB uses JIQ internally in their LB offerings?

c) has any open source LB software implement JIQ algorithm on the horizon? (according to Tyler, he found none.)

Can someone with expertise share some lights on these? Thanks.

[1]: https://www.microsoft.com/en-us/research/wp-content/uploads/...

jakewins · on Dec 9, 2016

The JIQ algorithm seems to have very nice overlap with the work being done on reactive streams and reactive network protocols, for instance http://reactivesocket.io/

The idea being that, on top of a general purpose message stream, you overlay a back-channel where the stream receiver can signal to the producer how many more messages it's ready to handle.

Effectively what this does is moving the implicit back-pressure you get from network buffers up to an application level thing, where something like an app server can reason about how much load it's willing to take on.

Shameless plug: If you think that seems cool and are into Go, I wrote a client+server impl of ReactiveSocket that I'd love API feedback on from more experienced Go devs.. https://github.com/jakewins/reactivesocket-go

wmf · on Dec 9, 2016

Google Scholar can come in handy here; there have been 117 papers citing JIQ and some of them may have improved on it: https://scholar.google.com/scholar?cites=3283470738712232937...

devy · on Dec 9, 2016

I am afraid that merely a citation reference doesn't tell the whole story. I guess someone has to dig into those papers and implement them and then use real world data to test and see, right?

faitswulff · on Dec 9, 2016

If anyone does want to dig in to the papers, a friend and I made a discussion platform for research. I've added all the papers in Tyler's talk as well as two he recommended on Twitter to a list here: https://www.projectcredo.com/briankung/server-load-balancing

I have zero background in academic computer science, so it may take me a while to understand even one of these papers.

wmf · on Dec 9, 2016

If a paper has something better than JIQ it should already contain benchmark results, so you could pick the paper with the best claimed performance and just implement that. The usual disclaimers about academic work apply.

Bombthecat · on Dec 9, 2016

I tested a ton of cdn networks. What i can say for sure is that the ms cdn is the fastest. So, i guess they use it for cdn at least.

acqq · on Dec 9, 2016

The slides are here:

http://www.slideshare.net/Fastly/load-balancing-is-impossibl...

Wrap up and Q&A:

https://www.youtube.com/watch?v=gas2v1emubU&t=30m24s

The article from the slide 23:

http://latencytipoftheday.blogspot.co.at/2014/06/latencytipo...

"#LatencyTipOfTheDay: MOST page loads will experience the 99%'lie server response"

idunno246 · on Dec 9, 2016

I thought the point from that article was a bit of a stretch, as most of the requests are going to be async ad requests, or cdn requests, and hopefully none of that blocks the user experience. I'd probably give it most sessions would experience it, which is nearly as bad. overall though it was really informative and interesting

majewsky · on Dec 9, 2016

> hopefully none of that blocks the user experience

Let's hope for the best. cough

cortesoft · on Dec 9, 2016

That latency tip is so naive.... it is assuming that performance profiles are randomly distributed. In the real world, this simply isn't the case.

The end of the article has an edit saying how time-correlated events would throw off the math, but it also would be thrown off by the real-world case that all CLIENTs are not created equal, either. If there is a slow connection between a client and a server, their requests will be served slower.

thomasahle · on Dec 9, 2016

Very nice. I was expecting this to ignore all the research on things like power of two choices and similar provable techniques, but the talk is actually really thorough in its research.

It's very interesting to see what experiences people have with using all of this in practice. The title seems a bit clickbaity though.

spydum · on Dec 9, 2016

Good talk, I feel like I've tread exactly this same thinking and given a similar talk internally (i accidentally ended up as the LB SME), even used poisson distribution to model the requests and do some code demos.

He references a simple algorithm/paper on simply chosing between two servers, and the results on the smoothing out of variance was impressive (paper: https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.... ).

So, no real conclusion, it's just interesting to peel back the layers of LB.. so many people treat it as magic, when really its a statistical lie - and impossible to achieve perfect distribution (hence the title)!

toast0 · on Dec 9, 2016

Round robin is somewhat better than random here, because you get more uniform distribution of requests to servers; but you do have to keep a counter and if you have multiple load balancers, with few requests, there's a chance that they'll get synchronized and you'll get bursts of traffic to one server, then the next, etc.

You still end up with uneven load if the requests are not equal in load generated (essentially always), but trying to be smart (choosing the least loaded/queued server) doesn't really work (too much work on load balancer, too much latency between request picking and load measuring, etc).

Pica_soO · on Dec 9, 2016

Imagine a beach. A beach is made of particles of different sizes, aka length. Now physics apply to this beach, as a line of "processors" that grab one particle out of the array and release it to the future beach, taking the time of length of the particle. There can be emptiness between particles.

First property of this, is "entropy" that the beach is self-sorting interaction by length, as time taken to process if close to a different sized particle.

Second property is the "light-speed" of this physical simulation. This is the maximum particle length multiplied with the number of processors.

Third property is causality, causality is the range of the furthest out particle reached by the light-speed length again multiplied by the light-speed. Obviously causality can be reduced by reducing light-speed.

Fourth property is non-interaction, if you add traversable, unmoving particles into the scheduler queues, this warps "time" as in, large particles are forced to jump over them, while small particles take free particle-size time to traverse, and as being non-interacting, are invisible to the huge particles.

Sixth property is determinism. The whole physic-simulation is precise and from a deterministic input, we can derive a deterministic output, as long as causality remains unviolated.

Now for scheduling, the data-physics are rather simplistic, as the processor-front, or "time" is not repeatedly interacting with the same beach.We also can determinate ahead of time, which processor interacts with what particle, assuming light-speed is not altered. Also note, that big particle being processed equal reduced light-speed.

Now what is optimal scheduling? Optimized for throughput, as in shoving huge particles to dedicated processors, with huge fields of non-interacting particles? Fair scheduling, as in having all particles having average particle-size time going through? Prevent a causality buildup over lights-peed?

PS: I once wrote a little visualization of these data-physics, and showed them around algorithm class- nobody ever beside the prof cared for it. Nice to meet people who are fascinated by mind games.

PS^2: You can use similar property's to compress data, via Conway's game of life into deterministic lock-step simulations. Never saw that used in the wild, as you need either pre-computated simulations to be fast, or some sort of relational operator on some simple pre-set of simple simulations.

PS^3: Sorry, got carried away. This is fascinating. Should upload the simulation to GitHub. Excuse any typos.

spacehacker · on Dec 11, 2016

> Should upload the simulation to GitHub.

Sounds interesting. I'd fork it!

siscia · on Dec 9, 2016

Great talk, however it is not explained the relationship between the Poisson process and the log-normal distribution.

It actually skip that explanation.

RangerScience · on Dec 9, 2016

Huh. It occurs to me that your brain needs to deal with a poisson process for incoming information. How does it handle it?

crasm · on Dec 9, 2016

Dropping packets.

witty_username · on Dec 9, 2016

Lots. You don't notice every sensation of every part of skin and of every sense. You're probably only focusing on specific part of a specific sense.

sjtrny · on Dec 9, 2016

Why not just use a PID controller to load balance?

y4mi · on Dec 9, 2016

thats afaik not done because of the performance overhead.

most requests take only a fraction of a second to answer. to actually monitor threads on that frequency you'd need not only a significant portion for the monitoring (ever looked at the CPU usage with running top?) but also for the network communication to send that over to the lb, which needs to process the data for load balancing

sjtrny · on Dec 9, 2016

You wouldn't need to monitor individual threads and processes. Only average response time for each server or total CPU load.

jest7325 · on Dec 9, 2016

Interesting talk and I am wondering if the op tried load-balanced algorithm like "leastresponsetime" and "leastconnection"? Today LB are pretty advanced and sometimes are even not even used in technology like ecmp.

bogomipz · on Dec 9, 2016

In my experience given talks given by Fastly are always worth seeing or attending.

If have the opportunity to attend a talk given by the founder Artur Bergman don't miss it.

It will be as informative as it is funny and entertaining. There are many on youtube from conferences past.

graycat · on Dec 9, 2016

For the case of load balancing as in the OP, for a statement of the problem, over time, requests for work arrive at an Internet computer server farm. The requests differ in what computing resources at the server farm are needed and for each resource how much work. E.g., a request may need 10 GB of main memory and 40 GB of virtual memory paged memory, 3 TB of I/O to rotating disk, and four processor cores, all for 50 minutes.

So, at each arrival, have to assign the work to a server. When the assignment is made, we do know about about the, say, state of each of the servers, say, what work has been assigned and how busy the server is.

The goal of the work is in some sense to get the best performance from the server farm for the loads. There are various candidate definitions/measures of performance, e.g., minimize the average elapsed time for the work of the requests to be done, that is, minimize the average response time as seen by the users/customers.

So, the work of the load balancing is just the assignments, one work request at a time, to the servers. This assignment is under uncertainty maybe about the details of each request and, so that can plan ahead, also about what the next requests will be.

E.g., if we had some idea that soon we would get a request that would need all of one server, then maybe we should think ahead and have reserved an empty server for that request. If we don't reserve such a server, then to assign this request we might have to keep waiting until have an empty server and, unless make an effort to have such an empty server, the request could go unassigned for a long time thus hurting the goal.

So, we have to make decisions over time under uncertainty to achieve a goal that is an average of the results.

Now likely we have a problem in the field of applied math called stochastic optimal control. In our case the control is the freedom we have when we make the assignments. Stochastic means varying with uncertainty over time. The stochastic part is the uncertainty of the next requests that arrive. The optimal part is that we want to do best as possible likely on some average measure, only average because we are working with uncertainty. Sure, if we don't like average, we could use median, etc.

IIRC there was some work on this problem by some especially diligent researchers at the IBM Watson lab and based in part on work of H. Kushner at the Division of Applied Mathematics at Brown University.

IIRC the good news is that those researchers made good progress. IIRC the bad news was that the computations for their solution were a bit too large to be practical! Also, from the references I saw them using, the math prerequisites for their work were a bit much!

So, maybe good work for a solution is in the literature but too difficult to use. Maybe.

My suggestion: Be practical. First cut, assign the next request for work to the least busy server. If the server farm is so busy that often all the servers are too busy for more work, then have to leave requests in a FIFO queue waiting until some server is not too busy for the first request in the queue. Collect empirical, production data on this control, the FIFO queue, the response times, any problems etc. If see no significant problems, then declare the problem solved for now. Else focus on where the problems are and look for a control that also handles those problems.

This material about stochastic optimal control was heavily from R. Bellman. Other authors include W. Fleming, D. Bertsekas, S. Shreve, E. Dynkin, R. Rockafellar, R. Wets. The field has been of interest by parts of applied math (e.g., Brown), operations research, and systems and electronic engineering.

This load balancing at a server farm contains as a special case some of the topics in queuing theory.

Computer science has long been interested in the problem of job scheduling on batch computer systems which, of course, is a relatively small special case of load balancing as in the OP. IIRC the usual result of the job scheduling work was that the problem was in NP-complete and otherwise in practice more work to do than the work being scheduled! Maybe now computer science will also be interested in something stochastic optimal control for load balancing.

abcdev · on Dec 9, 2016

Great presentation!

RunawayGalaxy · on Dec 9, 2016

Pretty sure that's a donkey.

barrystaes · on Dec 9, 2016

SFW cause no stupid meme slides. So.. whats the conclusion of the video i did not hear?