Readiness protocol problems with Unix daemons

lambda · on Sept 21, 2015

Yep, I've run into the problems with "expect fork"/"expect daemon" in Upstart first hand. Forking is a terrible readiness protocol, and Upstart is terrible at keeping track of a process that doesn't do exactly what it expects. In fact, if you choose the wrong one, you wind up with Upstart not even noticing the a process has gone away, and not being able to do anything to reset Upstart's knowledge of the world without spawning programs in a loop until the PID namespace wraps around and you get back to the original PID that it's expecting to track, so that Upstart can finally notice it die and properly clean up after it. This is seriously the only recommended workaround for a bug that went unfixed for years: https://bugs.launchpad.net/upstart/+bug/406397

I'm glad that Debian and Ubuntu have finally decided to go with an actually maintained init system, that properly track processes even after arbitrarily many forks and support a reasonable readiness protocol.

bonobo3000 · on Sept 21, 2015

I have also had a little mismatch with upstart for a particular application - the app itself spawns new processes in response to user input, but upstart can only track the main app not the children.

I am curious - precisely knowing when an app is running and when its not seems inherently dependent on the app and your definition of "running" for that app. Tracking apps from the system level in terms of processes etc. can only go so far. How do the init systems you mentioned tackle this?

Shish2k · on Sept 21, 2015

> precisely knowing when an app is running and when its not seems inherently dependent on the app [...] How do the init systems you mentioned tackle this?

In systemd's case, it provides an API so that the app can explicitly say "I am now ready to receive requests", among other things.

http://www.freedesktop.org/software/systemd/man/sd_notify.ht...

(I've also found the related watchdog functionality very useful -- once you tell the init system that your daemon is ready to serve requests, then you can also tell it every time a request is completed; then if you go for say 10 seconds without serving a request, init will assume that your service has hung and kill/restart it. I've seen a small number of daemons do this for themselves, but that number is nowhere near 100%, and even those who do implement it don't do it as well as systemd does.)

anonttttttt · on Sept 22, 2015

The problem is actually solved, see http://welz.org.za/notes/on-starting-daemons.html - it is just that people thesedays write flaky daemons which crash and want to have the init system restart services blindly - rather leave that to the monitoring tool which checks the service offered, so that we also notice if it is stuck in an infinite loop. Or even better, make crashes something to be eliminated, not papered over.

simoncion · on Sept 21, 2015

I'm wondering what's wrong with the way OpenRC's start-stop-daemon works. It either:

* forks a child to run a foreground task

or

* starts a self-daemonizing service

then waits for a user-configured period of time to make sure that the child or daemon is still around before declaring success.

OpenRC also provides programs to mark a service as started-but-inactive, and -I think [0]- was-inactive-but-is-now-fully-started [1]. This lets you avoid blocking system start with daemon that might take a while to initialize, or handle a daemon that might flap in and out of availability, or handle "Is it actually started" logic for daemons for which the answer to that question is pretty complex.

Daemons that require a given service to be started will wait to start until that service is no longer in the "inactive" state.

Edit: OpenRC can be configured to start a cgroup for a given service, and -presumably- kill all tasks in that cgroup on exit; mooting the problem of incorrectly marking a service as "failed".

[0] The documentation for this state is at odds with the state name. Regardless, the state is just a label for a particular service, so it seems reasonable to believe the state name rather than the manpage.

[1] Among several other states. See the BUILTINS section of openrc-run(8) for more info.

simoncion · on Sept 21, 2015

Almost completely unrelated to TFA:

In the Apache Tomcat Systemd House of Horror story, we find this:

> And let us mention another horror in catalina.sh: the logging. The dæmon process has its standard output and standard error redirected to a log file. This is a known-bad way to go about logging. One cannot size-cap the log file; one cannot rotate the log file; one cannot shrink the log file. It grows, as one file, forever until the dæmon is shut down.

I was under the distinct impression that logrotate can handle cat-std*-to-file style logfiles: to rotate, copy the logfile, then truncate the original.

Indeed, from logrotate(8):

  copytruncate
         Truncate the original log file to zero size in place after  cre‐
         ating  a copy, instead of moving the old log file and optionally
         creating a new one.  It can be used when some program cannot  be
         told  to  close  its  logfile  and  thus  might continue writing
         (appending) to the previous log file forever.  Note  that  there
         is a very small time slice between copying the file and truncat‐
         ing it, so some logging data might be lost.  When this option is
         used, the create option will have no effect, as the old log file
         stays in place.

Why has the author never read the logrotate man page? :(

sveiss · on Sept 21, 2015

Doing that is inherently racey (as acknowledged in your man page quote), and you will lose some log data while the rotation happens if it's being written to quickly enough. If you need all your log data, you can't rotate or shrink it with this style of logging, hence 'known-bad'.

simoncion · on Sept 22, 2015

It is inherently racy. Noone is disputing that. But, remember that the author said:

> One cannot size-cap the log file; one cannot rotate the log file; one cannot shrink the log file. It grows, as one file, forever until the dæmon is shut down. [0]

Anyone with even the most basic understanding of how star-nix [1] systems handle files understands this to be an untrue statement.

Take particular care to note that the author's statement was unqualified. If he had mentioned anywhere that you cannot do those things without some risk of data loss, there would be no controversy. Instead, he said -flat out- that rotating and size-capping a cat-std-whatever-to-log-file style log file is impossible.

[0] http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/sys...

[1] star-nix rather than $ASTERISKnix because I can't figure out how to escape a literal asterisk. :/