Sometimes you want the exact opposite, though. Consider an endpoint that makes 1...

jasonhansel · on Oct 17, 2022

That's why I said "except if your SLA requires it." If you've agreed--with other teams, with outside clients, or just as a project goal--that your service should work on 99.9% of calls, and you find in practice that you need to retry S3 calls to meet that target, then adding retries is reasonable.

If the problem is just that the 99 calls could overload downstream services on retries, the ideal solution is to add rate-limiting, though admittedly that is an imprecise science.

benlivengood · on Oct 17, 2022

There's never a reason that a human needs to read an error message caused merely by Kubernetes restarting a pod in the normal course of operations.

Passing every error up the stack causes people to be confused by 500 status errors for trivially retry-able calls. Browsers (and TCP) natively implement retries despite there being no SLAs in place.

geocar · on Oct 17, 2022

> Consider an endpoint that makes 100 behind-the-scenes requests

Don't do that; Don't ever make middleware retry.

An API endpoint should be quick, and if you can't figure out how to make it quick, use batch processing metaphors (submit task/query task) and a task-runner instead. The task-runner can retry indefinitely, but a user needs to be given feedback or they will push the refresh button.

qu4z-2 · on Oct 17, 2022

Lisp's conditions and restarts are great for this, so you can make the policy decision at a high level while not fully unrolling the stack and allowing retries to be done in the original context if that's the verdict.

zffr · on Oct 17, 2022

If a client retries it’s not necessary for all 100 requests to be attempted again. The system could be designed to only repeat the unsuccessful operations.

JustLurking2022 · on Oct 17, 2022

You mean like retrying at the lowest level?

vlovich123 · on Oct 18, 2022

You do want the policy on retries to live and be evaluated at the outermost level where your business logic lives. If that’s living across an RPC boundary, then you’re stuck making this weird trade off and where I think this back and forth is happening - people have different mental models of the specific service they have familiarity with that they’re using to test the recommendation against and because there’s a trade off to be made, the advice can be correct in some contexts and wrong in others. If you can encode your policy and it’s evaluation generically so it flows through the stack, that’s not terrible although it becomes hard to manage certain other kinds of SLAs (eg bounding latency of your overall operation or the latency of a specific suboperation).