Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sometimes you want the exact opposite, though. Consider an endpoint that makes 100 behind-the-scenes requests (say, to S3). You absolutely want to retry at the lowest level, not the highest level. You could fail on the 99th request. If you kick it up, the caller will retry and you'll do those 99 requests again, instead of just retrying the one that failed. With enough requests, there's a point where you're unlikely ever to succeed without one of the calls failing if you restart from the beginning every time. I don't think you can "one size fits all" this, and that's one of the reasons retries are hard.

I use S3 as an example because it has a comparatively high random failure rate. You MUST implement timeouts and retries if you're using S3 as a backend.



That's why I said "except if your SLA requires it." If you've agreed--with other teams, with outside clients, or just as a project goal--that your service should work on 99.9% of calls, and you find in practice that you need to retry S3 calls to meet that target, then adding retries is reasonable.

If the problem is just that the 99 calls could overload downstream services on retries, the ideal solution is to add rate-limiting, though admittedly that is an imprecise science.


There's never a reason that a human needs to read an error message caused merely by Kubernetes restarting a pod in the normal course of operations.

Passing every error up the stack causes people to be confused by 500 status errors for trivially retry-able calls. Browsers (and TCP) natively implement retries despite there being no SLAs in place.


> Consider an endpoint that makes 100 behind-the-scenes requests

Don't do that; Don't ever make middleware retry.

An API endpoint should be quick, and if you can't figure out how to make it quick, use batch processing metaphors (submit task/query task) and a task-runner instead. The task-runner can retry indefinitely, but a user needs to be given feedback or they will push the refresh button.


Lisp's conditions and restarts are great for this, so you can make the policy decision at a high level while not fully unrolling the stack and allowing retries to be done in the original context if that's the verdict.


If a client retries it’s not necessary for all 100 requests to be attempted again. The system could be designed to only repeat the unsuccessful operations.


You mean like retrying at the lowest level?


You do want the policy on retries to live and be evaluated at the outermost level where your business logic lives. If that’s living across an RPC boundary, then you’re stuck making this weird trade off and where I think this back and forth is happening - people have different mental models of the specific service they have familiarity with that they’re using to test the recommendation against and because there’s a trade off to be made, the advice can be correct in some contexts and wrong in others. If you can encode your policy and it’s evaluation generically so it flows through the stack, that’s not terrible although it becomes hard to manage certain other kinds of SLAs (eg bounding latency of your overall operation or the latency of a specific suboperation).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: