"The percentile values we observed had quite unusual distribution"
It is pretty normal to have a long tail in the last few percentile/tenths. Fortunately, most latency monitoring tools do account for this now. Another gotcha is that 99.9 isn't the end of your tail either. Sometimes looking at the "100th percentile" request isn't useful/such an outlier, but you should know it exists.
(regarding "Speculative task") Also called "Hedged Request" here in an article called "The Tail at Scale"[1]
Note that overhead that matters for Google doesn't matter as much for most other companies. Don't overcomplicate things just because Google does it like this.
I have found in many distributed systems the latency of top level requests follows a log-normal distribution, which sounds like what you are describing. But tuning when you launch backup requests you can trade off overhead for reduced tail latencies. After experimenting with it at Twitter I then found a Jeff Dean paper from Google describing it as well. Haven't been able to find that paper again though. Here is his presentation where he discusses backup requests:
The most common case I can think of is when you have local caches and randomly distribute traffic amongst your servers. The hit rate on the cache improves as you reduce the number of machines. If the cache expires before the same machine is hit again, you get no benefit from the cache at all. Even in the case where you are hitting the cache you are doing more original requests as you add more servers.
Removing capacity probably doesn't include removing capacity at all levels of the stack - perhaps it only means reducing the number of frontends, not db servers. That'd reduce load on the db server, allowing it to serve faster.
I've seen this approach appear without a name very often in highly concurrent Haskell, Erlang and Elixir code. Often a "safe" race combinator appears in good libraries that handles graceful termination. So much so that you can often find local libraries that look like this Haskell code:
speculatively :: Int -> IO a -> IO (Either a a)
speculatively ms action = race action (threadDelay ms >> action)
Wonder if for certain requests that are guaranteed to return smallish data sizes, having Redis support UDP would make sense and remove the TCP issues ...
But wouldn't this solution mean that once there is a small glitch in the system, say due to higher load than normal, there are 2x the requests and the system goes down completely?
I don't mean to be negative of the solution, I am just curious.
You are still bounded by just 2x the amount of requests, so no, this cannot take down the system, only slow it down a bit at worst. But not really, since you always need to have enough capacity for more than 2x load.
However, in my experience latencies are not static and depend on how far away the request is sent, the type of the resource requested, the size of the resource, current network load in that direction and other factors. Which gets tricky and complicated. At some point you need to store latest latency history for each request per each size group per each resource type per each node and dynamically calculate 90th percentile latency. But then things like size may not be predictable, so you may need to cap response sizes to a sufficiently small value. And so on.
If your responses are small, it's easier to just always send two requests in parallel to different servers and choose the fastest one.
There is no point in having your service try to speculate what happened when that isn't its job.
As long as your service doesn't need to ensure things happen only once or you build in that recovery mechanism (identifying unique events and throwing them out if your system has already seen them), being able to toss work to another instance is great in my mind.
I think the OP is saying the people involved should investigate the root cause instead of working around the problem. Not that the _service_ should try to do it when needed somehow.
> If your backend is Redis, how is starting more speculative hits to Redis going to help, since it's single threaded?
I agree; really understanding this problem is a lot better than just blindly retrying because it seems to work. Is there something wrong with the network that'll affect every service?
Retrying could be the solution, but it should be so with the acknowledgement that it's incurring technical debt.
I think it's a false dichotomy if you're thinking of finding the root cause instead of speculative requests.
You can do none, one of them, or both. You may have team skills to tackle none, one, or both. You may or may not control / have access to one of the sides. You may have a timeline for solving this problem where you decide which task is faster to finish. Finally, you may have a recurring problem which makes you lose a specific amount of money every time and a speculative request stops that now, while putting people on a performance debugging mission may only potentially have a positive return in the future.
Sure, it's a technical debt. It's not illegal - you just need a really good reason to commit to it.
> I think the OP is saying the people involved should investigate the root cause instead of working around the problem. Not that the _service_ should try to do it when needed somehow.
Maybe. But again, this condition can occur with no flaws. It may just be you're on the edge of an autoscale event or a box with a bad xen peer. There may not a be a problem. It doesn't matter.
So sure, folks should keep to their SLA, but it doesn't have a lot of bearing on this. It's just best practice.
As an aside, quite frankly if you're using Redis you may deserve a slow service. Redis is a very tricky service to correctly integrate into a cloud. It's massively overused by folks who prefer it on grounds of "simplicity" to more appropriate (and ultimately, forgiving) tools.
Oh also, because so many people use languages where even a basic production-quality binary tree is not easy to write. in.
A week ago I might have thought this would be good for our microservices, as they were already using short timeouts with a retry but abandoning the original request. This is better. However it has limitations and isn't the solution. It suffers from request amplification if the called service also depends on other services and applies the same retry policy. It it still request latency bound. The main takeaway I've learned for bounding microservices is that the inflow of dependent data should be asynchronous and request latency dependent only on the service's data.
This seems like a useful practice. Do you log how often this happens and which task is the winner? It seems like without that, you may not notice actual issues.
Since the tasks are identical, the only real value such a metric would have is in considering memory allocation for green threads and other issues around memory pressure and resource pooling.
Usually, it doesn't matter why the consumer finds it slow; it could be anything from an unusual I/O issue (AWS is notorious for brief but painful networking issues) to a VM peer suddenly becoming ill behaved before being throttled.
If you find that something shifts and you are almost always starting the second task but the first task is still winning then you are in a state where the solution became extra load for no benefit.
(regarding "Speculative task") Also called "Hedged Request" here in an article called "The Tail at Scale"[1]
[1] http://www-inst.eecs.berkeley.edu/~cs252/sp17/papers/TheTail...