Hi, author of the post here. The thing that really surprised me, when writing this, was how difficult it was to find the edge of the topic: where does process supervision end, and server monitoring begin? If most of these programs include something for handling logs, are they then also logging software? Where's the boundary between watching processes and gathering metrics on them? Good software is canonically supposed to do one thing well, but it seems like nobody can agree what handles what here. I even started writing sections on logging and server monitoring, before sanity prevailed and I chopped them out.
It's maddening that we have no clear separation of responsibilities for this stuff. And so we end up with things like Upstart's half-hearted logging, because it's not clear where Upstart's job stops.
The tangled world of service monitoring, process monitoring, process launching, configuration management etc etc et bloody c is waiting for a ZFS-style collapsing of multiple fighting tools into a single layer.
I noticed this most of all when I was writing puppet manifests that ... create Upstart scripts.
Why do I have one tool that configures and supervises the bits at rest and a completely different tool that configures and supervises the same bits in flight?
I still prefer Bernsteins daemon-tools. I first used them when I had to deploy qmail and now I use them to manage Python webapplications. The default logging option is a bit weird, but the simplicity is a real winner for me.
I don't like Upstart, I don't agree that it's "admirably simple to use", it needlessly complex. For what I do, I don't need or want to care about runlevels, I always want respawning and I don't like the syntax of the configuration.
systemd seems like a really complex solution to a very simple problem, but I never used it.
Supervisord while being well documented and popular again seems to complicate what I want to do. Again I dislike the configuration, I fail to see why a shell script isn't better.
Runit I never used, because why would I replace daemon-tools with, what was when I first read about it, just a daemon-tools clone, made by people who didn't like Bernsteins licensing. I guess it's what I would most likely go for, if daemon-tools where not available.
Runit is very similar in design to Daemontools, but it seems generally more featureful and better maintained. If you like one, chances are you'll like the other. (Also, daemontools' licensing is no longer an issue; it's been public domain since 2007.)
I do take issue with your characterization of Upstart and Systemd as needlessly complex; almost all of that complexity is for their role as init.d replacements, and doesn't need to affect you if you just want to get a few programs running. Their big selling point, for me, is that one of them is usually the default and the config files are quite easy to write if you're not doing anything elaborate.
As a diehard daemontools user this comment made me take another look at runit. Just to be clear, runit is one of several forks of daemontools. Another one is daemontools-encore [1].
There are a couple nice runit features that stock daemontools lacks.
The ability to run each service in a separate process session with "runsvdir -P" is a big one. If your ./run file consists of a pipeline, bouncing the service with TERM will produce orphans otherwise. (I have some patches to daemontools-encore that do this on a per-service basis.)
The svlogd program from runit also looks nicer than multilog. I've tried to like tai64, I really have, but I'd rather just have human AND machine readable timestamps from the get go.
I like that service dependencies aren't statically declared in runit, but rather something you can block on in your ./run file.
Daemontools has never let me down. It plays nicely with the OS in that I don't have to muck around with essential system daemons and init scripts, and it corrects all sorts of bad behavior. I run everything with it: python, java (including long-running servers), ruby, perl... Really a fantastic package.
I think it's a question of which tool you learn first. You learn one, it solves your problems and the rest does it differently, so they're "wrong" or difficult in your eyes.
accidental downvote (arrows are teeny on mobile!).
I agree with liking the simplicity of daemontools. I haved used it for a while, and it has served me well. I like that logging is split. I often run the log script as just a logger instance to get stdout to syslog.
Upstart is.... Not my favorite.
runit is nice. Like daemontools with manages and a few extra flags/features.
If you do what that guys says and use runit, you'll be happy. I use runit, and I am happy with it.
However, I also think you'll be happy with Upstart or Systemd, even if you find them over-engineered and inelegant. They'll do what they need to with minimal configuration, and you're probably already using one of them behind the scenes. Why not use one of them, if you have it sitting right there, already installed and configured?
But don't do this in your code ;) Do use a service manager to do this for you. Programs which lack a foreground option are harder to test and interact with due to the action-at-a-distance -- one of my chief objections with init.d-style service management over what I'll call daemontools-style service management. Nearly all popular daemons will have an option to stay in foreground for this reason. The only exceptional case that comes to mind is nginx, which does have a foreground option, but you'll lose zero-downtime upgrades if you use it (due to some extreme cleverness).
Like others have already said, monit also deserves a mention. Whilst it can start the process for you, it probably doesn't really fall under the same category. It's more about making sure things are running, do not consume too much resources etc.
However, the flexibility of monit beyond process monitoring is really great. Anything from filesystem usage, cpu, memory to monitoring tcp ports, file checksums, you name it. I'm yet to find something you can't do with this little toy, and it's rock solid. I even use it to pull graphite stats[1] and report when certain thresholds are reached, like the percentage of 500s compared to other response codes in my nginx logs.
I also personally like its configuration syntax better than most yaml-like DSLs. It feels almost like writing natural text.
+1 to monit. I have an app that does a lot of scraping via Selenium, headless using xvfb and Firefox. If the Selenium scripts fail, Firefox never closes. On an EC2 micro instance, it doesn't take long before 40MB+ FF processes kill other things, like Tomcat (which runs my queue processor). I use monit to watch the root URL, and when it fails, it fires off a handful of scripts that restart the various pieces of the application.
highly recommend runit + monit. The beauty of runit is the ./run script is easy to run by hand to test. The beauty of monit is it can watch things like http urls and ports, and if something goes wrong, monit executes sv down myservice; sv up myservice. (sv is the actual launcher-thingy of runit)
If you use monit without runit, you end up in this weird world where monit starts and stops things in a very odd environment and it's flakier, environment variables are missing, etc. Also monit takes a long time to notice things are down, and doesn't start things right away on boot.
Finally if you don't boot your services with monit, you booted them with something else meaning that when a restart occurs, you're starting a service in a different environment than you booted in. weird.
so, let monit check health of things, let runit start and stop things.
This didn't really seems like a comparison, more of a description. I was hoping for a bit more.
We're currently using supervisord but we've had to hack it to support more than 200 workers, and I'd like something with the ability to change the number of workers dynamically.
Systemd has socket activation, where it opens sockets and passes them to applications.
But the maintainers seem to have sub-zero interest in passing said listening sockets to multiple applications, which I'm sorry to see not be available. I'm thankful for this article, because it discusses Supervisord and Circus's willingness to do some pooled program handling.
Amazed to see so many simple responses in this thread given that the single-system paradigm is sort-of heading out of date now (manual Unix systems administration is arguably a dying horse). For any serious (ie. highly available) systems, a major option that should be under discussion is Pacemaker/Corosync. http://clusterlabs.org/
It's a pain to learn, easy to screw up, non-trivial to test exhaustively, but provides highly flexible daemon migration and monitoring functionality you aren't likely to find anywhere else, for free, and for any imaginable service. (Actually you can monitor anything with it, including hardware. Check it out.)
These projects seem to do a very poor job of presenting themselves. I spent a good deal of time reading about each and have yet to have a really good grasp of what they are for. They seem to be about maintaining a cluster of hardware servers to run arbitrary services with configuration and fail over.
Do you know of some decent article, blog post, video, whatever that presents these products, what they are good for and a typical use case?
I agree wholeheartedly about the failure to communicate clearly. At least one contributing factor to the present situation is the lucrative consulting that exists around these solution types and helps to fund their development. Simply put, if it was that easy, everyone would be doing it (and we'd be out of a job). However, to be fair, things change fast and there does exist a lot of good documentation - just not necessarily perfectly up to date for your scenario. Your assumption is perfectly correct. Have a look at http://www.linbit.com/en/downloads/tech-guides or http://clusterlabs.org/doc/ or try #linux-ha or #linux-cluster on freenode.
Is pacemaker/corosync not more of a replacement for things like keepalived/heartbeatd (often used in conjunction with stonith, drbd), and as a way to run clusters of services?
You still need to launch and run the services themselves with something. (sysV init scripts, etc)
Even a cursory review of the docs seems to imply the same.
> Version 1 of Heartbeat came with its own style of resource
> agents and it is highly likely that many people have
> written their own agents based on its conventions.
> Although deprecated with the release of Heartbeat v2,
> they were supported by Pacemaker up until the release
> of 1.1.8 to enable administrators to continue to use
> these agents.
Usually there is a resource agent script that replaces the init script. If not, you just write one. This is a barrier to low-level users, but is not an issue for experienced admins/programmers/devops. The primary community library of such lives here: https://github.com/ClusterLabs/resource-agents/
Resource agent scripts can support master/slave style services (including promotion/demotion) in addition to enabling the cluster to self-manage nontrivial overall system state transitions on a multi-host basis. You define the target running state with a declarative syntax that is replicated to across all nodes.
Heartbeat is an earlier platform that has now been functionally replaced by corosync.
I wouldn't use god. It's had issues crashing or misbehaving in the past. I've found monit very stable if a little awkward to configure correctly.
Today I would use systemd with monit for additional monitoring. systemd ensures that the same environment is used regardless of how it is invoked, and the units are extremely simple. Monit can ensure a running server is behaving (memory limits, CPU limits, testing HTTP) and rely on systemctl for restarting processes.
One thing I always found annoying is, say I am on a shared host like WebFaction, and I use supervisord to manage my uWSGI (or whatever) instances. What do I do to ensure supervisord itself starts? Current approach is a script run by cron that greps ps for supervisord and if not found, launches it. But that seems rather flakey and unreliable.
Interesting hack, sure, but "poor man solutions" would be interesting if "rich man solutions" cost too much. That is not the case here. For someone that can understand this, runit will take < 10 minutes to learn.
Tried it and I can tell you it's not production ready. It would randomly go rogue on our production servers and would start falsely detecting a worker crash and start launching more.
It took us a long time to trace it down and we quickly switched to runit and supervisor on a few servers. All of our problems went away.
I am interested in any form of feedback on the issues you had. Falsely detecting a work crash sounds very weird and unprobable because Circus uses the system PID list to check on processes - so I wonder what happens in your case.
Monit is another standard program, but I think these these things being compared are really not similar. Aren't upstart and systemd replacements for init.d runlevel scripts? While God & Monit make sure programs are running. I suppose there is overlap.
There's no clear delineation here. Upstart and Systemd are init.d replacements, both of which can make sure that programs are up and running, and incorporate some basic process monitoring. Runit is similar, and can replace init.d, but will happily run as just another process. God can't replace init.d, but it can easily handle the daemonize-and-keep-running functionality that's at the core of what we want from all this stuff. In addition, God can do some health checking, for things like memory and CPU consumption. IIRC Monit doesn't actually start other programs under itself, but it can be configured to watch running programs, and kick them if necessary.
If you have a clean and useful taxonomy for this stuff, I'd be interested in hearing it.
Monit will startup a program if it's not running according to it's pid file. In other words if the pid file doesn't exist it will start the program. I use it in deploys by killing a program and letting Monit start it again with the new version.
systemd encompasses process monitoring functionality by design, which is a strength imho, though it is subject to the "all your eggs in one basket" criticism.
1. Keep services running: Runit, even with daemontools differences, is hard to beat.
2. Hard(-ish) resource limits and accounting: LXC w/ cgroups. Almost as good as full paravirtualization (Xen). There are still some issues with limiting resource contention impact between cgroups.
3. Softer resource limits, rogue app restarter: We've heavily modified bluepill because it seemed to lack insight on the needs and challenges of large-scale production ops. Specifically, we've added optional total child process limits (a few issues reported, fixed and even submitted a pull request). It might be useful to add max # of processes, nic bandwidth, iops and couple other checks.
It was designed to be simple (creating a new daemon conf file is trivial), low over-head (it's very init-ish in function so it needs to not take a lot of memory or processor time), stable (can't crash or it loses track of everything that it launched), and secure (lets you launch your daemons as different users--usually at a lower security level).
I've been using it myself an almost every server I administrate (I wrote it to scratch my own itch) for a couple years now and it's been very stable.
runit user here. For someone without a strong understanding of UNIX the runit documentation will strike her as a little low level. That being said, the learning curve is steep but short. What’s missing is a blog post that explains UNIX from the point of view of runit. Starting & restarting and logging. It’s actually a good place to start learning more about UNIX.
Zed Shaw has also starting a project of his own to tackle these things. Not sure how it handles multiple unrelated things. Seems to be a good idea to document in understandable Python code, what the requirements are to daemonize a process corrently.
When I saw the title "vs. God", I thought it was going to include something like "Tup vs. the Eye of Mordor", a whimsical performance comparison between the Tup build system and "The Eye or Mordor" in the guise of a simple script that does no dependency checking because it already knows what has changed: http://gittup.org/tup/tup_vs_mordor.html
"[Upstart] currently has no way to adjust the maximum number of file descriptors, or limit memory usage"
That's weird, because that's one of the things i like about Upstart. The ability to set pam limits and others in the same file. Upstart seems to support the same options Runit memory wise: stack, data, memlock ...
My favorite remains BSD style (rc.conf+rc.d/) init - which for some reason receives no mention here. Arch Linux used to have it too, but then inexplicably dumped it.
Inexplicably? They didn't want to fight against upstream. You'd have to maintain your own scripts for every daemon you supported if you didn't use systemd. Also other software like udev is getting sucked into systemd as well. Arch is a small distro and doesn't have the development power to customize everything.
Because having long bash scripts and every process implementing it's own daemonisation in weird and wonderful ways is redundant and inconsistent, and makes making something a daemon frustrating.
I have been using supervisor with my django apps for the past 2 years. I have no real complaints, and actually like how you can easily add new processes with a single conf file and then type
supervisorctl update to start the new deamon without having to restart the whole supervisor process over again ...
Just FYI, that feature is shared by all the alternatives mentioned in the article. (The exception would be runit, which has one supervisor process per service, thus making the issue moot.)
We're using Bluepill, which has a friendly DSL to define CPU and RAM usage limitations. This can be quite handy if you need to occasionally restart a process with memory leaks or the like.
I believe God and Monit have similar capabilities, although I haven't used them personally.
Angel author here. It works very well for us and some others, but it must be said it is not as featureful as most of these. For example, it provides no assistance for log rotation (you'd need to do this via a shell pipeline with multilog or similar) and does not have a "control interface" (zmq socket etc) of any kind.
I think that logging should be separate here, so I'm actually okay with built-in logging support being minimal to nonexistent. If you use multilog (from daemontools) it will handle logging to rotated files; if you use svlogd (from runit), it can do the same, and also includes syslog support. If you just log to syslog in the first place, and your syslogd is at all modern, it can also handle writing things to log files, here or on a remote machine, which is cool too if you're okay with configuring it.
Or you can do something event more radical, like using fluentd[1], which looks really useful and well-designed. I like having this decision decoupled from the process manager. So, I would not count that as a strike against Angel.
Yeah, I agree. To this end, I just added support for specifying a logger process, and angel will now pipe stdout/stderr to it ala daemontools. Definitely don't want angel in the business of being a log rotator.
I'm amazed at how far the linux world has fallen in the last decade. It is now "normal" for people to run an extra layer of buggy crap software to monitor their buggy crap software and restart it when it crashes. Rather than the long standing unix solution of not using buggy crap software that crashes constantly in the first place.
All that unix software was so great! Apache 1! Woo! bind! Yeah! sendmail! Nice! That's pretty much it! It never crashed and was so secure! Happy times!
It's maddening that we have no clear separation of responsibilities for this stuff. And so we end up with things like Upstart's half-hearted logging, because it's not clear where Upstart's job stops.