isn't it just masking the root cause of whatever thats causing the delay in the ...

KirinDave · on April 13, 2018

Why does the consumer of a remote resource care about an emphemeral root cause? It's a massive scope creep for your API, which has SLAs to keep.

For all you know you're in the middle of a horizontal scaling event and hit an overloaded box, and there is no problem.

madamelic · on April 13, 2018

I agree.

There is no point in having your service try to speculate what happened when that isn't its job.

As long as your service doesn't need to ensure things happen only once or you build in that recovery mechanism (identifying unique events and throwing them out if your system has already seen them), being able to toss work to another instance is great in my mind.

bajsejohannes · on April 13, 2018

I think the OP is saying the people involved should investigate the root cause instead of working around the problem. Not that the _service_ should try to do it when needed somehow.

Like philsnow says in a different comment (https://news.ycombinator.com/item?id=16832566)

> If your backend is Redis, how is starting more speculative hits to Redis going to help, since it's single threaded?

I agree; really understanding this problem is a lot better than just blindly retrying because it seems to work. Is there something wrong with the network that'll affect every service?

Retrying could be the solution, but it should be so with the acknowledgement that it's incurring technical debt.

viraptor · on April 14, 2018

I think it's a false dichotomy if you're thinking of finding the root cause instead of speculative requests.

You can do none, one of them, or both. You may have team skills to tackle none, one, or both. You may or may not control / have access to one of the sides. You may have a timeline for solving this problem where you decide which task is faster to finish. Finally, you may have a recurring problem which makes you lose a specific amount of money every time and a speculative request stops that now, while putting people on a performance debugging mission may only potentially have a positive return in the future.

Sure, it's a technical debt. It's not illegal - you just need a really good reason to commit to it.

KirinDave · on April 13, 2018

> I think the OP is saying the people involved should investigate the root cause instead of working around the problem. Not that the _service_ should try to do it when needed somehow.

Maybe. But again, this condition can occur with no flaws. It may just be you're on the edge of an autoscale event or a box with a bad xen peer. There may not a be a problem. It doesn't matter.

So sure, folks should keep to their SLA, but it doesn't have a lot of bearing on this. It's just best practice.

KirinDave · on April 14, 2018

As an aside, quite frankly if you're using Redis you may deserve a slow service. Redis is a very tricky service to correctly integrate into a cloud. It's massively overused by folks who prefer it on grounds of "simplicity" to more appropriate (and ultimately, forgiving) tools.

Oh also, because so many people use languages where even a basic production-quality binary tree is not easy to write. in.