Defeat your 99th percentile with speculative task

colonelxc · on April 13, 2018

"The percentile values we observed had quite unusual distribution" It is pretty normal to have a long tail in the last few percentile/tenths. Fortunately, most latency monitoring tools do account for this now. Another gotcha is that 99.9 isn't the end of your tail either. Sometimes looking at the "100th percentile" request isn't useful/such an outlier, but you should know it exists.

(regarding "Speculative task") Also called "Hedged Request" here in an article called "The Tail at Scale"[1]

[1] http://www-inst.eecs.berkeley.edu/~cs252/sp17/papers/TheTail...

zzzcpan · on April 13, 2018

Note that overhead that matters for Google doesn't matter as much for most other companies. Don't overcomplicate things just because Google does it like this.

spullara · on April 13, 2018

I have found in many distributed systems the latency of top level requests follows a log-normal distribution, which sounds like what you are describing. But tuning when you launch backup requests you can trade off overhead for reduced tail latencies. After experimenting with it at Twitter I then found a Jeff Dean paper from Google describing it as well. Haven't been able to find that paper again though. Here is his presentation where he discusses backup requests:

https://static.googleusercontent.com/media/research.google.c...

hnaccy · on April 13, 2018

>Non-intuitive: remove capacity under load to improve latency (?!)

Does he elaborate on this?

spullara · on April 13, 2018

The most common case I can think of is when you have local caches and randomly distribute traffic amongst your servers. The hit rate on the cache improves as you reduce the number of machines. If the cache expires before the same machine is hit again, you get no benefit from the cache at all. Even in the case where you are hitting the cache you are doing more original requests as you add more servers.

kolpa · on April 13, 2018

It could mean

* one of your servers is unhealthy, remove it from the pool.

* reject some requests to make the rest faster.

swsieber · on April 14, 2018

Removing capacity probably doesn't include removing capacity at all levels of the stack - perhaps it only means reducing the number of frontends, not db servers. That'd reduce load on the db server, allowing it to serve faster.

KirinDave · on April 13, 2018

I've seen this approach appear without a name very often in highly concurrent Haskell, Erlang and Elixir code. Often a "safe" race combinator appears in good libraries that handles graceful termination. So much so that you can often find local libraries that look like this Haskell code:

    speculatively :: Int -> IO a -> IO (Either a a)
    speculatively ms action = race action (threadDelay ms >> action)

philsnow · on April 13, 2018

If your backend is Redis, how is starting more speculative hits to Redis going to help, since it's single threaded?

I mean if this is indeed helping it seems that it's not Redis that's the long pole, but maybe the request routing mesh or something.

xjia · on April 14, 2018

Variations are probably caused by TCP IMO

limeyx · on April 14, 2018

Wonder if for certain requests that are guaranteed to return smallish data sizes, having Redis support UDP would make sense and remove the TCP issues ...

maho · on April 13, 2018

But wouldn't this solution mean that once there is a small glitch in the system, say due to higher load than normal, there are 2x the requests and the system goes down completely?

I don't mean to be negative of the solution, I am just curious.

cortesoft · on April 13, 2018

Yes, I am always suspicious of techniques that add additional load under failure states. Good way to cascade your failure.

zzzcpan · on April 13, 2018

You are still bounded by just 2x the amount of requests, so no, this cannot take down the system, only slow it down a bit at worst. But not really, since you always need to have enough capacity for more than 2x load.

However, in my experience latencies are not static and depend on how far away the request is sent, the type of the resource requested, the size of the resource, current network load in that direction and other factors. Which gets tricky and complicated. At some point you need to store latest latency history for each request per each size group per each resource type per each node and dynamically calculate 90th percentile latency. But then things like size may not be predictable, so you may need to cap response sizes to a sufficiently small value. And so on.

If your responses are small, it's easier to just always send two requests in parallel to different servers and choose the fastest one.

kolpa · on April 13, 2018

That's why you need load shedding, regardless of whether you do speculative tasks.

theparanoid · on April 13, 2018

You're correct, they should be more paranoid.

tshanmu · on April 13, 2018

isn't it just masking the root cause of whatever thats causing the delay in the first place?

KirinDave · on April 13, 2018

Why does the consumer of a remote resource care about an emphemeral root cause? It's a massive scope creep for your API, which has SLAs to keep.

For all you know you're in the middle of a horizontal scaling event and hit an overloaded box, and there is no problem.

madamelic · on April 13, 2018

I agree.

There is no point in having your service try to speculate what happened when that isn't its job.

As long as your service doesn't need to ensure things happen only once or you build in that recovery mechanism (identifying unique events and throwing them out if your system has already seen them), being able to toss work to another instance is great in my mind.

bajsejohannes · on April 13, 2018

I think the OP is saying the people involved should investigate the root cause instead of working around the problem. Not that the _service_ should try to do it when needed somehow.

Like philsnow says in a different comment (https://news.ycombinator.com/item?id=16832566)

> If your backend is Redis, how is starting more speculative hits to Redis going to help, since it's single threaded?

I agree; really understanding this problem is a lot better than just blindly retrying because it seems to work. Is there something wrong with the network that'll affect every service?

Retrying could be the solution, but it should be so with the acknowledgement that it's incurring technical debt.

viraptor · on April 14, 2018

I think it's a false dichotomy if you're thinking of finding the root cause instead of speculative requests.

You can do none, one of them, or both. You may have team skills to tackle none, one, or both. You may or may not control / have access to one of the sides. You may have a timeline for solving this problem where you decide which task is faster to finish. Finally, you may have a recurring problem which makes you lose a specific amount of money every time and a speculative request stops that now, while putting people on a performance debugging mission may only potentially have a positive return in the future.

Sure, it's a technical debt. It's not illegal - you just need a really good reason to commit to it.

KirinDave · on April 13, 2018

> I think the OP is saying the people involved should investigate the root cause instead of working around the problem. Not that the _service_ should try to do it when needed somehow.

Maybe. But again, this condition can occur with no flaws. It may just be you're on the edge of an autoscale event or a box with a bad xen peer. There may not a be a problem. It doesn't matter.

So sure, folks should keep to their SLA, but it doesn't have a lot of bearing on this. It's just best practice.

KirinDave · on April 14, 2018

As an aside, quite frankly if you're using Redis you may deserve a slow service. Redis is a very tricky service to correctly integrate into a cloud. It's massively overused by folks who prefer it on grounds of "simplicity" to more appropriate (and ultimately, forgiving) tools.

Oh also, because so many people use languages where even a basic production-quality binary tree is not easy to write. in.

karmakaze · on April 14, 2018

A week ago I might have thought this would be good for our microservices, as they were already using short timeouts with a retry but abandoning the original request. This is better. However it has limitations and isn't the solution. It suffers from request amplification if the called service also depends on other services and applies the same retry policy. It it still request latency bound. The main takeaway I've learned for bounding microservices is that the inflow of dependent data should be asynchronous and request latency dependent only on the service's data.

pspeter3 · on April 13, 2018

This seems like a useful practice. Do you log how often this happens and which task is the winner? It seems like without that, you may not notice actual issues.

KirinDave · on April 13, 2018

Since the tasks are identical, the only real value such a metric would have is in considering memory allocation for green threads and other issues around memory pressure and resource pooling.

Usually, it doesn't matter why the consumer finds it slow; it could be anything from an unusual I/O issue (AWS is notorious for brief but painful networking issues) to a VM peer suddenly becoming ill behaved before being throttled.

esrauch · on April 14, 2018

If you find that something shifts and you are almost always starting the second task but the first task is still winning then you are in a state where the solution became extra load for no benefit.

dswalter · on April 14, 2018

Is your list of things AWS is notorious for available on your blog somewhere? That's valuable information.

KirinDave · on April 14, 2018

AWS is notorious for having absolutely awful networking. Every other CSP rep rightly digs on them.

They're also notorious becuass ECS is SO BAD.

Scramblejams · on April 14, 2018

In your experience, who’s got the best networking?