I had a tech lead who vehemently agreed with the parent commenter (retry from the top), but I ended up learning different lessons.
* Differentiate retryable and non-retryable errors. If the service can't return success because the DB it queries is borked, it should send a non-retryable error. Then it won't get overwhelmed by retries from upstream.
* Retry configuration should have sane defaults. Even "retry once" is too often; many services aren't overprovisioned for 100% increase in traffic. What ended up working here for us was having the retry module collect req/sec statistics, then only allowing 20% of that number in retries/second. Individual requests can be retried twice. That was small enough to not push over any services, but enough retries to compensate for garden-variety unavailability.
* Services shouldn't serve requests first-come-first-served under load. When SHTF, fairness means everyone has to suffer long delays, often much longer than rpc timeout. Instead, serve them in an unfair order - the most recent requests to come in are the most likely to have a caller still interested in them. Answer those!
* Use headless services in kubernetes. By exposing the replicaset to the client, the client can load balance itself more intelligently. Retries should go to different replicas than the failing request. Furthermore, you can perform request hedging to different replicas than the lagging request.
* Define a degraded form of your in-memory object-graph. If a feature is optional to the core business flow, it shouldn't take down your whole product. This one is a lot more involved. We needed custom monitoring for degraded responses, in-memory collection and storage of "guesses" to substitute for degraded portions of the object graph, as well as some other work I can't think of right now. This does enable an organization to compartmentalize better, having faster, less fearful deploys of newer initiatives.
It seems like knowing the difference between a retryable and non-retryable error is itself difficult (perhaps impossible in practice).
DB is unreachable? Is it the DB or my host? If it’s the DB, maybe it’s not retryable, but if it’s me (and I’m load-balanced), it is probably retryable.
* Differentiate retryable and non-retryable errors. If the service can't return success because the DB it queries is borked, it should send a non-retryable error. Then it won't get overwhelmed by retries from upstream.
* Retry configuration should have sane defaults. Even "retry once" is too often; many services aren't overprovisioned for 100% increase in traffic. What ended up working here for us was having the retry module collect req/sec statistics, then only allowing 20% of that number in retries/second. Individual requests can be retried twice. That was small enough to not push over any services, but enough retries to compensate for garden-variety unavailability.
* Services shouldn't serve requests first-come-first-served under load. When SHTF, fairness means everyone has to suffer long delays, often much longer than rpc timeout. Instead, serve them in an unfair order - the most recent requests to come in are the most likely to have a caller still interested in them. Answer those!
* Use headless services in kubernetes. By exposing the replicaset to the client, the client can load balance itself more intelligently. Retries should go to different replicas than the failing request. Furthermore, you can perform request hedging to different replicas than the lagging request.
* Define a degraded form of your in-memory object-graph. If a feature is optional to the core business flow, it shouldn't take down your whole product. This one is a lot more involved. We needed custom monitoring for degraded responses, in-memory collection and storage of "guesses" to substitute for degraded portions of the object graph, as well as some other work I can't think of right now. This does enable an organization to compartmentalize better, having faster, less fearful deploys of newer initiatives.