It seems that the speaker is discussing small servers; 150 concurrent requests per server/process sounds very small to me (but maybe this is common in enterprise?); in this case, I can see why the variance introduced by the random algorithm is such a problem.
But if you have an efficient system where each server can handle 10K+ concurrent requests, then the variance introduced by the random algorithm becomes insignificant. Also, it doesn't matter that some requests take much longer than others; if you're dealing with a lot of users per server, the slow requests will be distributed evenly too and average out.
I implemented the balls-and-bins algorithm and I fond that when dealing with few buckets and a large number of balls, the variance in load between buckets became smaller. I even added randomness to the 'weight' of each request (to account for slow vs fast requests) and the distribution was still even - The more requests each server can handle, the more even the distribution is with the random algorithm.
10K+ concurrent requests to a single server sounds like a server that is mostly shuffling bits around. I assume the 150 concurrent is more realistic for a frontend server which actually does something.
Also note the speaker is a CTO of a CDN (fast.ly), I am guessing he has experience with large concurrent requests as well :)
10k requests/s is the standard I expect to achieve with a HTTP load balancer without any challenge. (HAProxy or nginx)
It may be less with HTTPS, depends on how efficient the CPU is at cryptographic algorithms and how many cores.
fastly is a CDN. I assume that they're talking about the performances of their CDN servers. Not the web servers of the clients that will process the request, and is indeed a lot slower operation.
Yeah I guess I'm mostly thinking of a typical REST API workload where you just query a database, maybe do some small transformations to the data and then send it to clients.
Also, it'd have to be a server with good async support like Node.js, Go, Haskell, Scala, Tornado, nginx...
Considering that the speaker is CTO at a CDN company (which has to deal with a lot of different kinds of back-ends that are outside of their control), it makes sense that they would need to use and algorithm which can handle all possible scenarios - They can't force their customers to use bigger servers and more efficient systems.
> 10K+ concurrent requests to a single server sounds like a server that is mostly shuffling bits around.
Isn't that all a load balancer is supposed to do? I certainly don't want my load balancers performing computations or logic. I want it to pass that work off to another server.
If you actually intend to serve traffic back to the original requestor, you can't have too many requests handled per server because you have a limited amount of bandwidth. 150 requests per server on a 1Gbps pipe (which is what you get on, for instance, a single ELB instance) only actually gives you about 813 KB/s per request for 150 requests/server. For 10,000 requests you'd be down to 12 KB/s per request. For comparison, the average web page size is now up to 2048 KB!
To be clear, you can do a lot better with a better pipe, smart caching, compression, etc. But people often have horribly unrealistic estimates about how much traffic their servers can handle because they don't take bandwidth into account, and load balancers are no exception.
Of course, when you break it down by individual web request, most responses are still below 800KB, but you shouldn't load plan for the average case. And clearly even the average case is well above 12KB, especially for a CDN (which is responsible for serving the image, video, and large script content). I'm also pretty confident the page I linked already includes compression (which decreases size, but can increase time quite a bit; many people expect software load balancers to be using the absolute fastest compression available, but that's often not the case in my experience).
Sure, if you actually set the headers correctly and the browser actually retains them (recall that browsers are free to discard their caches at any time). And, if they can't be cached on the user's computer, who caches them? That's right, the CDNs. Like fast.ly. Who wrote the article.
One of the approaches and the last Tyler, the speaker, introduced in the talk to improve the fat tail distribution of load balancing latency problem was Join-Idle-Queue algorithm[1] that came out of UIUC Extreme Computing Group, Microsoft Research and Azure in 2011.
And according to Tyler's data (fast.ly CDN data since he's the CTO there), it beats all the current load balancing algorithms including Join-Shortest-Queue. Six years later, I wonder:
a) was there any other novel / better algorithms tops JIQ in LB performance?
b) has Microsoft Azure LB uses JIQ internally in their LB offerings?
c) has any open source LB software implement JIQ algorithm on the horizon? (according to Tyler, he found none.)
Can someone with expertise share some lights on these? Thanks.
The JIQ algorithm seems to have very nice overlap with the work being done on reactive streams and reactive network protocols, for instance http://reactivesocket.io/
The idea being that, on top of a general purpose message stream, you overlay a back-channel where the stream receiver can signal to the producer how many more messages it's ready to handle.
Effectively what this does is moving the implicit back-pressure you get from network buffers up to an application level thing, where something like an app server can reason about how much load it's willing to take on.
Shameless plug: If you think that seems cool and are into Go, I wrote a client+server impl of ReactiveSocket that I'd love API feedback on from more experienced Go devs.. https://github.com/jakewins/reactivesocket-go
I am afraid that merely a citation reference doesn't tell the whole story. I guess someone has to dig into those papers and implement them and then use real world data to test and see, right?
If anyone does want to dig in to the papers, a friend and I made a discussion platform for research. I've added all the papers in Tyler's talk as well as two he recommended on Twitter to a list here: https://www.projectcredo.com/briankung/server-load-balancing
I have zero background in academic computer science, so it may take me a while to understand even one of these papers.
If a paper has something better than JIQ it should already contain benchmark results, so you could pick the paper with the best claimed performance and just implement that. The usual disclaimers about academic work apply.
I thought the point from that article was a bit of a stretch, as most of the requests are going to be async ad requests, or cdn requests, and hopefully none of that blocks the user experience. I'd probably give it most sessions would experience it, which is nearly as bad. overall though it was really informative and interesting
That latency tip is so naive.... it is assuming that performance profiles are randomly distributed. In the real world, this simply isn't the case.
The end of the article has an edit saying how time-correlated events would throw off the math, but it also would be thrown off by the real-world case that all CLIENTs are not created equal, either. If there is a slow connection between a client and a server, their requests will be served slower.
Very nice. I was expecting this to ignore all the research on things like power of two choices and similar provable techniques, but the talk is actually really thorough in its research.
It's very interesting to see what experiences people have with using all of this in practice. The title seems a bit clickbaity though.
Good talk, I feel like I've tread exactly this same thinking and given a similar talk internally (i accidentally ended up as the LB SME), even used poisson distribution to model the requests and do some code demos.
So, no real conclusion, it's just interesting to peel back the layers of LB.. so many people treat it as magic, when really its a statistical lie - and impossible to achieve perfect distribution (hence the title)!
Round robin is somewhat better than random here, because you get more uniform distribution of requests to servers; but you do have to keep a counter and if you have multiple load balancers, with few requests, there's a chance that they'll get synchronized and you'll get bursts of traffic to one server, then the next, etc.
You still end up with uneven load if the requests are not equal in load generated (essentially always), but trying to be smart (choosing the least loaded/queued server) doesn't really work (too much work on load balancer, too much latency between request picking and load measuring, etc).
Imagine a beach. A beach is made of particles of different sizes, aka length.
Now physics apply to this beach, as a line of "processors" that grab one particle out of the array and release it to the future beach, taking the time of length of the particle. There can be emptiness between particles.
First property of this, is "entropy" that the beach is self-sorting interaction by length, as time taken to process if close to a different sized particle.
Second property is the "light-speed" of this physical simulation. This is the maximum particle length multiplied with the number of processors.
Third property is causality, causality is the range of the furthest out particle reached by the light-speed length again multiplied by the light-speed. Obviously causality can be reduced by reducing light-speed.
Fourth property is non-interaction, if you add traversable, unmoving particles into the scheduler queues, this warps "time" as in, large particles are forced to jump over them, while small particles take free particle-size time to traverse, and as being non-interacting, are invisible to the huge particles.
Sixth property is determinism. The whole physic-simulation is precise and from a deterministic input, we can derive a deterministic output, as long as causality remains unviolated.
Now for scheduling, the data-physics are rather simplistic, as the processor-front, or "time" is not repeatedly interacting with the same beach.We also can determinate ahead of time, which processor interacts with what particle, assuming light-speed is not altered.
Also note, that big particle being processed equal reduced light-speed.
Now what is optimal scheduling? Optimized for throughput, as in shoving huge particles to dedicated processors, with huge fields of non-interacting particles?
Fair scheduling, as in having all particles having average particle-size time going through?
Prevent a causality buildup over lights-peed?
PS: I once wrote a little visualization of these data-physics, and showed them around algorithm class- nobody ever beside the prof cared for it. Nice to meet people who are fascinated by mind games.
PS^2: You can use similar property's to compress data, via Conway's game of life into deterministic lock-step simulations. Never saw that used in the wild, as you need either pre-computated simulations to be fast, or some sort of relational operator on some simple pre-set of simple simulations.
PS^3: Sorry, got carried away. This is fascinating. Should upload the simulation to GitHub. Excuse any typos.
thats afaik not done because of the performance overhead.
most requests take only a fraction of a second to answer. to actually monitor threads on that frequency you'd need not only a significant portion for the monitoring (ever looked at the CPU usage with running top?) but also for the network communication to send that over to the lb, which needs to process the data for load balancing
Interesting talk and I am wondering if the op tried load-balanced algorithm like "leastresponsetime" and "leastconnection"? Today LB are pretty advanced and sometimes are even not even used in technology like ecmp.
For the case of load balancing
as in the OP, for a statement of the
problem, over time, requests for work
arrive at an Internet computer
server farm. The requests differ in
what computing resources
at the server farm are needed
and for each resource how much work.
E.g., a request may need 10 GB of main
memory and 40 GB of virtual memory
paged memory, 3 TB of I/O to
rotating disk, and four processor
cores, all for 50 minutes.
So, at each arrival, have to
assign the work to a server.
When the assignment is made,
we do know about about the, say,
state of each of the servers,
say, what work has been assigned
and how busy the server is.
The goal of the work is in some
sense to get the best performance
from the server farm for the loads.
There are various candidate
definitions/measures of performance,
e.g., minimize the average
elapsed time
for the work of the requests to
be done, that is, minimize the
average response time
as seen by the users/customers.
So, the work of the load
balancing is just the
assignments, one
work request at a time,
to the servers. This
assignment is under
uncertainty maybe about
the details of each request
and, so that can
plan ahead, also about what
the next requests will be.
E.g., if we had some idea
that soon we would get a
request that would need
all of one server, then
maybe we should think ahead
and have reserved an empty
server for that request. If
we don't reserve such a
server, then to assign
this request we might have to
keep waiting until have
an empty server and, unless
make an effort to have
such an empty server,
the request could
go unassigned for a long
time thus hurting the goal.
So, we have to make decisions
over time under uncertainty
to achieve a goal that is
an average of the results.
Now likely we have a problem in
the field of applied math called
stochastic optimal control.
In our case the control is the freedom
we have when we make the
assignments. Stochastic
means varying with uncertainty
over time. The stochastic part
is the uncertainty of the
next requests that arrive.
The optimal part is that
we want to do best as possible
likely on some average measure,
only average because we
are working with uncertainty.
Sure, if we don't like average,
we could use median, etc.
IIRC there was some work on
this problem by some especially
diligent researchers at the IBM
Watson lab and based in part on
work of H. Kushner at the
Division of Applied Mathematics
at Brown University.
IIRC the good news is that those researchers made good
progress. IIRC the bad news was
that the computations for
their solution were a bit
too large to be practical!
Also, from the references I
saw them using, the math
prerequisites for their
work were a bit much!
So, maybe good work for
a solution is in the literature
but too difficult to use.
Maybe.
My suggestion: Be practical.
First cut, assign the next
request for work to the least
busy server. If the server farm
is so busy that often all the
servers are too busy for more
work, then have to leave requests
in a FIFO queue waiting until
some server is not too busy
for the first request in the queue.
Collect empirical, production data on this control, the FIFO queue,
the response times, any
problems etc. If see no significant
problems, then declare the problem
solved for now. Else focus on where
the problems are and look
for a control that also handles
those problems.
This material about stochastic optimal control was heavily from R. Bellman. Other authors include W. Fleming,
D. Bertsekas, S. Shreve, E. Dynkin, R.
Rockafellar, R. Wets. The field has
been of interest by parts of
applied math (e.g., Brown), operations
research, and systems and
electronic engineering.
This load balancing at a server
farm contains as a special
case some of the topics
in queuing theory.
Computer science has long
been interested in the
problem of job scheduling
on batch computer systems
which, of course, is a
relatively small special
case of load balancing
as in the OP. IIRC the usual
result of the job scheduling
work was that
the problem was in NP-complete
and otherwise in practice
more work to do than
the work being scheduled!
Maybe now computer science
will also be interested in
something stochastic
optimal control for
load balancing.
But if you have an efficient system where each server can handle 10K+ concurrent requests, then the variance introduced by the random algorithm becomes insignificant. Also, it doesn't matter that some requests take much longer than others; if you're dealing with a lot of users per server, the slow requests will be distributed evenly too and average out.
I implemented the balls-and-bins algorithm and I fond that when dealing with few buckets and a large number of balls, the variance in load between buckets became smaller. I even added randomness to the 'weight' of each request (to account for slow vs fast requests) and the distribution was still even - The more requests each server can handle, the more even the distribution is with the random algorithm.