That's a slippery slope argument: because we can't solve the halting problem or ...

ori_b · on Aug 15, 2015

> Also don't mistake my position for saying that quality isn't important. Rather, just that perfection is not a reasonable constraint.

At some point, for some component or set of components, perfection is your only choice, regardless of the rest of your design. At least when you consider a single node with a single point of failure; this is less true for a distributed system where you have redundancy.

At some point, you have to assume that either init is perfect, or that the code in the kernel to detect init failures is perfect, or that the watchdog monitoring the kernel is perfect, or whatever other layering you choose to put in place is perfect.

In a system with a finite number of components, there is always going to be a point at which you just say "this bit is going to have to be correct, and there's no other way around it".

dap · on Aug 15, 2015

I think you've misunderstood my point. The design I described does not require any component to be perfect. If any of these components (including the kernel) crashes, the system _still_ converges to the correct state.

ori_b · on Aug 15, 2015

You missed my point. At the moment, you're assuming that the kernel behavior will reliably, buglessly fall into one of two outcomes: Either lossless full system reset, or detect init has failed and restart it. You're ignoring the possibility of, deadlocks, failures in detecting that init has crashed, bugs in the special casing of init to restart it, etc. You are assuming that there are components that do certain things perfectly reliably.

You haven't gotten rid of a correctness assumption, you've just shuffled it around a bit.

vezzy-fnord · on Aug 14, 2015

At least on Linux, PID1's death is an instant kernel panic. As such, it's wise to keep any service management logic out of it.

If the svscan process dies, then your system is still chugging along and you can intervene to restore the supervision tree (otherwise svscan inspects supervise processes at a regular 5s interval). If you have some really critical process, then you could integrate a checkpointer into the run script chain so that you can just pick off from the last image of the process state with minimal interruption.

tgma · on Aug 15, 2015

> we can't solve the halting problem or verify program correctness

We can't solve halting problem or verify program correctness for all programs. For a large subset of all programs, you are able to do both. This is a very important distinction.

We even have verified C compilers (CompCert). Writing a verified service manager should be easier in comparison.