There are many reasons a process can die that are outside of its control, including signals from outside the process, handled (but uncorrectable) memory errors, and the OOM killer (on Linux). Besides that, it seems like a major design shortcoming if fatal errors in any particular program (however critical and however simple that program may be) can be unrecoverable for the whole system.
It's definitely possible to solve this problem rigorously and completely, though I don't know of a way to do it without support from the kernel. On illumos systems, the service restarter ("svc.startd") provides a complex restart policy for user-defined services. I believe the restarter itself is restarted blindly by init, and init is restarted blindly by the kernel. If the kernel dies, the whole system is rebooted. In this way, if any software component in the chain of restarters fails, the system still converges to the correct state.
If you're having hardware memory issues, your system is already in an undefined and unstable state.
If you send it a SIGTERM, it runs your shutdown scripts and reboots. If you send it a SIGKILL, your kernel will panic. As far as I remember, this isn't any different from init.
The OOM killer will _never_ take PID 1.
In runit all PID 1 has to do is run the service scanner. All that has to do is open directories and fork/exec the individual service managers. If it fails, it will try again in 1 second, forever. No complex logic needed. Just keep trying. In practice, it works surprisingly well.
If the individual service managers fail to run the startup script, it will try again in 1 second, forever. It works very will for most situations, but you _can_ customize this behavior easily. This simplicity is really helpful in an actual emergency because you don't get emergent behavior, like init deciding that your service is flapping and holding it down for 5 minutes.
Anecdotally, I've been using runit exclusively on all my systems (around 25 physical systems and 20-100 virtual ones depending) for at least 8 years now and I've never had a single issue.
The biggest problem I have with the design is that it puts your log services in a second level "behind" your main services, so you can sometimes miss that your log service failed to startup for some reason. This can be a real pain if your service uses blocking IO for it's stdout/stderr logging as it can cause the service to hang seemingly without explanation.
This is one of the reasons it's common to run a watchdog in HA critical systems. If something fails in supervision or at a low level and nothing is responding, nobody is around to tickle the watchdog and the entire system reboots.
What if init doesn't exit, and just hangs? What if it just goes crazy and starts erronously restarting your processes? There are more failure modes than simply crashing.
At some point, you just have to assume that some critical components are working correctly. Adding complexity just makes it harder to reason about it, or, depending on how paranoid you are, prove it.
That's a slippery slope argument: because we can't solve the halting problem or verify program correctness, we shouldn't try to handle crashes, either?
Agreed on minimizing complexity. The only part of the chain I described that's very complex is svc.startd, and that's largely to support rich configuration.
Also don't mistake my position for saying that quality isn't important. Rather, just that perfection is not a reasonable constraint.
> Also don't mistake my position for saying that quality isn't important. Rather, just that perfection is not a reasonable constraint.
At some point, for some component or set of components, perfection is your only choice, regardless of the rest of your design. At least when you consider a single node with a single point of failure; this is less true for a distributed system where you have redundancy.
At some point, you have to assume that either init is perfect, or that the code in the kernel to detect init failures is perfect, or that the watchdog monitoring the kernel is perfect, or whatever other layering you choose to put in place is perfect.
In a system with a finite number of components, there is always going to be a point at which you just say "this bit is going to have to be correct, and there's no other way around it".
I think you've misunderstood my point. The design I described does not require any component to be perfect. If any of these components (including the kernel) crashes, the system _still_ converges to the correct state.
You missed my point. At the moment, you're assuming that the kernel behavior will reliably, buglessly fall into one of two outcomes: Either lossless full system reset, or detect init has failed and restart it. You're ignoring the possibility of, deadlocks, failures in detecting that init has crashed, bugs in the special casing of init to restart it, etc. You are assuming that there are components that do certain things perfectly reliably.
You haven't gotten rid of a correctness assumption, you've just shuffled it around a bit.
At least on Linux, PID1's death is an instant kernel panic. As such, it's wise to keep any service management logic out of it.
If the svscan process dies, then your system is still chugging along and you can intervene to restore the supervision tree (otherwise svscan inspects supervise processes at a regular 5s interval). If you have some really critical process, then you could integrate a checkpointer into the run script chain so that you can just pick off from the last image of the process state with minimal interruption.
> we can't solve the halting problem or verify program correctness
We can't solve halting problem or verify program correctness for all programs. For a large subset of all programs, you are able to do both. This is a very important distinction.
We even have verified C compilers (CompCert). Writing a verified service manager should be easier in comparison.