Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In practice, due to the way network / system failures tend to work at scale, failure of a first retry is generally strongly correlated with failure of a second retry. Thus a second retry can be more problematic than a first (especially if each retry causes load). From that you can infer that a single retry at the highest level is the right approach (most of the time as always YMMV). It's worth measuring this for your own services in production with real workloads by including a metric that captures how often a first and a second retry succeed.

When you don't choose zero / one as your multiplier, there's a strong risk of implementing a retry strategy that is multiplicative. E.g. given 3 layers with a try and then 3 retries at each layer you cause a potential 64x (4x4x4) amplification of any failure at the lowest level. Retries are an easy way to overload a service that would otherwise recover from a problematic situation.

Adaptive retry using a token bucket / circuit breaker approach are reasonable second alternative to zero/one.

In practice, for resilient systems, you can actually go even further than zero retries when you have shared knowledge of an outage to the downstream service (due to concurrent calls from the same source). You can choose not make the call altogether, and look at only making a small amount of calls to the service to let it recover sanely. Obviously this is only useful for calls that are optional parts of a call chain. An example implementation skip a percentage of calls based on the percentage of failing calls (e.g. if 50% of the calls are failing due to an overloaded downstream service, backing off to only make around 50% of the call volume is directionally appropriate).

Better logging is always appreciated regardless of situation ;)



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: