Sorry for hijacking your expertise, but why no mention of memory leaks? In my ex...

throwaway892238 · on Nov 24, 2022

It depends how well the service was "operationalized":

1) Best case: Monitoring of the service checks for service degradation outside of a sliding window. In this case, more than X percent of responses are not 2xx or 3xx. After a given time period (say, 30 minutes of this) the service can be restarted automatically. This allows you to auto-heal the service for any given "degradation" coming from that service itself. (This does not detect upstream degradation, of course, so everything upstream needs its own monitoring and autohealing, which is difficult to figure out, because it might be specific to this one service. The development/product team needs to put more thought into this in order to properly detect it, or use something like chaos engineering to see the problem and design a solution)

2) If you have a health check on the service (that actually queries the service, not just hits a static /healthcheck endpoint that always returns 200 OK), and a memory leak has caused the service to stop responding (but not die), the failed health check can trigger an automatic service restart.

3) The memory leak makes the process run out of memory and die, and the service is automatically restarted.

4) Ghetto engineering: Restart the service every few days or N requests. This extremely dumb method works very well, until you get so much traffic that it starts dying well before the restart, and you notice that your service just happens to go down on regular intervals for no reason.

5) The failed health check (if it exists) is not set up to trigger a restart, so when the service stops responding due to memory leak (but doesn't exit) the service just sits there broken.

6) Worst case: Nothing is configured to restart the service at all, so it just sits there broken.

If you do the best practice and put dynamic monitoring, a health check, and automatic restart in place, the service will self-heal in the face of memory leaks.