Runit is amazing. I've used it on several large scale websites with great success. Runit follows the unix philosophy of being stupid simple and doing one thing incredibly well. If you're starting up a new project, consider using Runit.
Not only is it simple and does the do-the-one-thing-well thing well, which is true, it's also _correct_. It's hard to overstate how valuable runit is in production because of that. It does the right thing with regard to clearing the environment, detaching from controlling terminal, logging, and many other subtle aspects of operating a service. I never worry about runit.
Back when I changed from supervisord to runit my life was substantially improved. That correctness means way fewer emergency maintenance ops issues in production.
First of all and mostly, we could never exactly figure it out what was going wrong.
This was a few years ago and I don't remember all the details. We just had occasional issues with stopping/starting and especially restarting processes when pushing a new version out or when a process crashed. I do remember it could occasionally report a successful restart and still leave the old process(es) running.
All my developers became very familiar with supervisord, and it was number 2 on the troubleshooting list (1. Did we introduce a bug in a recent commit? 2. Did supervisord do something weird again).
After we switched to runit, only devs that touched ops knew about it at all. And we all forgot it was there. That's what you want in an ops tool.
There are many reasons a process can die that are outside of its control, including signals from outside the process, handled (but uncorrectable) memory errors, and the OOM killer (on Linux). Besides that, it seems like a major design shortcoming if fatal errors in any particular program (however critical and however simple that program may be) can be unrecoverable for the whole system.
It's definitely possible to solve this problem rigorously and completely, though I don't know of a way to do it without support from the kernel. On illumos systems, the service restarter ("svc.startd") provides a complex restart policy for user-defined services. I believe the restarter itself is restarted blindly by init, and init is restarted blindly by the kernel. If the kernel dies, the whole system is rebooted. In this way, if any software component in the chain of restarters fails, the system still converges to the correct state.
If you're having hardware memory issues, your system is already in an undefined and unstable state.
If you send it a SIGTERM, it runs your shutdown scripts and reboots. If you send it a SIGKILL, your kernel will panic. As far as I remember, this isn't any different from init.
The OOM killer will _never_ take PID 1.
In runit all PID 1 has to do is run the service scanner. All that has to do is open directories and fork/exec the individual service managers. If it fails, it will try again in 1 second, forever. No complex logic needed. Just keep trying. In practice, it works surprisingly well.
If the individual service managers fail to run the startup script, it will try again in 1 second, forever. It works very will for most situations, but you _can_ customize this behavior easily. This simplicity is really helpful in an actual emergency because you don't get emergent behavior, like init deciding that your service is flapping and holding it down for 5 minutes.
Anecdotally, I've been using runit exclusively on all my systems (around 25 physical systems and 20-100 virtual ones depending) for at least 8 years now and I've never had a single issue.
The biggest problem I have with the design is that it puts your log services in a second level "behind" your main services, so you can sometimes miss that your log service failed to startup for some reason. This can be a real pain if your service uses blocking IO for it's stdout/stderr logging as it can cause the service to hang seemingly without explanation.
This is one of the reasons it's common to run a watchdog in HA critical systems. If something fails in supervision or at a low level and nothing is responding, nobody is around to tickle the watchdog and the entire system reboots.
What if init doesn't exit, and just hangs? What if it just goes crazy and starts erronously restarting your processes? There are more failure modes than simply crashing.
At some point, you just have to assume that some critical components are working correctly. Adding complexity just makes it harder to reason about it, or, depending on how paranoid you are, prove it.
That's a slippery slope argument: because we can't solve the halting problem or verify program correctness, we shouldn't try to handle crashes, either?
Agreed on minimizing complexity. The only part of the chain I described that's very complex is svc.startd, and that's largely to support rich configuration.
Also don't mistake my position for saying that quality isn't important. Rather, just that perfection is not a reasonable constraint.
> Also don't mistake my position for saying that quality isn't important. Rather, just that perfection is not a reasonable constraint.
At some point, for some component or set of components, perfection is your only choice, regardless of the rest of your design. At least when you consider a single node with a single point of failure; this is less true for a distributed system where you have redundancy.
At some point, you have to assume that either init is perfect, or that the code in the kernel to detect init failures is perfect, or that the watchdog monitoring the kernel is perfect, or whatever other layering you choose to put in place is perfect.
In a system with a finite number of components, there is always going to be a point at which you just say "this bit is going to have to be correct, and there's no other way around it".
I think you've misunderstood my point. The design I described does not require any component to be perfect. If any of these components (including the kernel) crashes, the system _still_ converges to the correct state.
You missed my point. At the moment, you're assuming that the kernel behavior will reliably, buglessly fall into one of two outcomes: Either lossless full system reset, or detect init has failed and restart it. You're ignoring the possibility of, deadlocks, failures in detecting that init has crashed, bugs in the special casing of init to restart it, etc. You are assuming that there are components that do certain things perfectly reliably.
You haven't gotten rid of a correctness assumption, you've just shuffled it around a bit.
At least on Linux, PID1's death is an instant kernel panic. As such, it's wise to keep any service management logic out of it.
If the svscan process dies, then your system is still chugging along and you can intervene to restore the supervision tree (otherwise svscan inspects supervise processes at a regular 5s interval). If you have some really critical process, then you could integrate a checkpointer into the run script chain so that you can just pick off from the last image of the process state with minimal interruption.
> we can't solve the halting problem or verify program correctness
We can't solve halting problem or verify program correctness for all programs. For a large subset of all programs, you are able to do both. This is a very important distinction.
We even have verified C compilers (CompCert). Writing a verified service manager should be easier in comparison.
I use GNU dmd instead. Simple and very extensible with Scheme. I use it as PID 1 on 2 of my machines, but I use another instance of it as a user service manager on all of my machines. I've also been meaning to replace runit with dmd in Phusion's passenger-docker image to get something more hackable.
There's nothing wrong with dynamic allocation in a kernel. It is for example better than having fixed size process tables and all the crap that comes with that.
"holy crap I've got to recompile my kernel to get more processes" is so 1995...
I've used s6 a lot, though not as an init replacement in any systems that matter. As a supervisor I think it's great. For basic stuff (root process supervising supervisors supervising daemons) the two are basically identical with some superficial differences. For wider ranging stuff s6 is really good since it has a bunch of ancillary programs that solve a lot of serious issues in full system supervision (readyness notification vs polling in a run script, ucspi socket handing, etc).
I missed actually answering the question about the difference between the two (plus, that comment is in dire need of editing and the edit window appears to have expired).
The superficial differences are things like: service to logger pipe holding happens at different levels of the supervision tree (the root `s6-svscan' holds them in s6, the per-service runsv holds it in runit), `s6-svc -CMD behavior cannot currently be overridden whereas you can with `sv CMD', `s6-svscan' will immediately re-scan its directory with SIGALRM whereas `runsvdir' only polls for changes on a 5 second timer.
For basic supervision tasksboth are great, with runit being the simpler of the two in terms of understanding what it gives out of the box. For larger tasks (full system supervision, inter-service ordering dependencies, etc) s6 has the tools to make that easy whereas with runit you're going to find yourself playing stupid tricks in run scripts to get similar behavior.
A genuine question, as it's not quite clear to me: can someone explain like I'm 5, what can I use runit for exactly? Is it a replacement for init.d or something more like monit? I always use monit to keep my Rails-related processes running and I've read somewhere that using monit combined with runit may be a better solution, but isn't it a bit redundant to have the two at the same time?
Essentially it's an alternative method for starting long running services (and restarting them if they fail).
The things all daemontools-inspired process supervisors do:
Runs a scanner gainst a service directory containing a directory for each program you want supervised (classically these are symlinks to another part of the system).
The scanner spawns a supervisor program for every directory it finds which then looks for a program called `run' inside that directory and runs it as a foreground child.
If the supervisor's child stops, it runs an optional script in it's supervision directory called `finish', and then runs run again.
If one of the scanner's child supervisors stops, the scanner spawns it again.
In runit's case, the scanner is called `runsvdir', the supervisors are called `runsv', and it also comes with a program called `runit' that can act as a replacement for your PID1 init whose sole job outside of boot and shutdown time is to resurrect runsvdir if it exits. Note that runsvdir is perfectly happy to be run via an entry in your /etc/inittab if you're running under sysvinit.
At their most simplistic, that's all process supervision is - a bomb-proof way of keeping services running. All supervisors in the daemontools family come with a control program to interact with the process supervisor, as well as a stdin-based logger that doesn't rely on syslog. The main benefits it brings over the classic init.d/ model are simplicity (the most complex run script I have is 14 lines of dead simple shell, most are 4-5) and automatic restartability which init.d/ daemonization doesn't have.
From what I understand from monit's manpage, it's a full-bore rule-basd system monitor. Generally speaking, process supervision is tacking one problem (keep daemons running), whereas monit is tacking another (system state monitoring). Yes monit can act as a process supervisor, but it does so by polling the system state and hooking into the existing daemonization infrastructure.
Now that I think of it, the chef cookbook for Runit is what made it so easy to deploy to production. The 'runit_service' resource was absolutely invaluable. Forget the complicated upstart stanzas or dealing with supervisord, just write a shell script to run your program in the proper environment and ba! You've got a service!
since we're plugging init cookbooks, i'll plug a systemd cookbook[0] we're working on. would love some feedback from chef users using systemd-based systems!
This was how I got introduced to runit, and I couldn't agree more. It's really amazing at what it does, and quickly has become my favorite way of keeping processes running. It's just so easy.
Nice to see daemontools-inspired work on HN in the past few days. I've been using runit for years, and as other comments have said, it's just not something I ever have to worry about. When I look around at other solutions it seems like I'd be taking a huge step back (making things more complex with arguably less usefulness).
I've been meaning to do some blog posts about runit, as it seems like it sits on the back-burner in general. Does anyone know how actively it's maintained, or has it just reached such stability that it doesn't need much maintenance? It would neat to see it on github with some real docs and such.
runit is what the Phusion Docker base image http://phusion.github.io/baseimage-docker/ is using, and it's the perfect tool to start and supervise containerized apps.
I also love the fact that it can wait for a service to run (in order to wait for dependencies), and that stuck services can be restarted in a more radical way than a single TERM signal.
I'm using this too (after suffering all the other supervisors), and I have no clue why runit isn't the standard tool in Linux. Can any Linux expert explain that?
Linux is a kernel. It has no standards for userspace, though some have been attempted for GNU/Linux in particular and there are de facto conventions, but that's it.
Given the other glowing comments, maybe this would be a place to ask: should I bother with Upstart, or Systemd? I see Shuttleworth announced the move to systemd, but it's not available on Ubuntu 14.04 servers right now. I'm writing provisioning for our production fleet, what should I use?
The vagaries of the OS wars make something like Runit tempting.
Depends heavily on what you need to provision, and whether you can select your provisioned environment based on what you want to support. Upstart is, at this point, effectively dead for any future distribution. However, if you want to support existing Ubuntu LTS distributions for the remainder of their lifetime, you'll need to handle it. Similarly, if you want to support the current rounds of enterprise distributions, you still need to support sysvinit. And if you have any non-Linux systems, you'll need to handle whatever they use as well.
On the other hand, if you can ensure that your production systems all run relatively recent Linux distributions, you can safely assume systemd. Which specific distributions you have in production will determine the oldest version of systemd you have to support; since new features get added regularly, you'll want to know the oldest version you can assume.
What are you trying to provision, and what requirements do you have?
I appreciate the detailed reply, and the others as well. To address your questions and some of the others:
We can generally pin our entire fleet against whatever we want -- until we can't. One particular (very large, very important) dependency currently requires CentOS 6.5. Go figure.
As much as possible, I want to treat infrastructure as a commodity, so catering to specific versions (even LTR versions) breaks that goal. As another reply notes, if we lock in at Ubuntu 12.04 and it's predictable, that's excellent.
As another mentions, Upstart was easy to get a handle of quickly. As an aside, it bugs me that the (otherwise excellent and detailed) Upstart docs don't seem to mention that all of Upstart is now deprecated. Hence the original confusion! In general I don't want flexibility (not first and foremost), I want simple and consistent.
> One particular (very large, very important) dependency currently requires CentOS 6.5.
What dependency is that? And can you run it in a single-purpose virtual machine, rather than directly on real hardware, to make it easier to manage?
> As much as possible, I want to treat infrastructure as a commodity, so catering to specific versions (even LTR versions) breaks that goal. As another reply notes, if we lock in at Ubuntu 12.04 and it's predictable, that's excellent.
You definitely don't want to cater to specific versions any more than you have to; ideally you want as few versions across your entire fleet as you can. If you could get it down to just CentOS and the latest version of some up-to-date distribution, that would help; if you could make CentOS a virtual machine under that same up-to-date distribution, that's even better, insofar as you can then manage all the physical machines identically.
> As another mentions, Upstart was easy to get a handle of quickly. As an aside, it bugs me that the (otherwise excellent and detailed) Upstart docs don't seem to mention that all of Upstart is now deprecated. Hence the original confusion!
The Upstart docs aren't going to say that Upstart is deprecated. It's more a view of the Linux distribution ecosystem as a whole: the one distribution that primarily drove upstart usage and adoption is switching to systemd, so there won't be further momentum or resources behind upstart in the future. That doesn't make it instantly obsolete, but it does mean that starting a new project today around upstart is a bad idea.
(In fairness, there's one other notable distribution using upstart as well, namely Chrome OS; however, that's a bit of a special case in several ways, and in any case I hope to change that in the future.)
I'm wary of naming the dependency, because it's a vendor's solution and I'm not trying to call them out, but it's their ecosystem of tools and it has to run as the host OS, not virtualized. But all your comments are spot-on, and I appreciate it.
The Upstart thing is just somewhat odd, because it's a Canonical project, and Ubuntu is also Canonical's. At any rate, you're right that it's not immediately lost. But trying to pin against even the long-term releases of Ubuntu seems hopeless. I've been working on upgrading a 12.04 deployment to 14.04. There aren't any features that we're missing in the OS, so it's really only a concern from a security or support perspective. The shop I'm in was already using Ubuntu when I came on, but maybe the lesson here is to migrate to Debian, rather than bother upgrading to the newer Ubuntu release. (Migrating to CentOS would be somewhat more work.)
> The Upstart thing is just somewhat odd, because it's a Canonical project, and Ubuntu is also Canonical's.
That's why they pushed so hard for Debian to use Upstart. Once Debian switched to systemd instead, Canonical announced that they would too.
> The shop I'm in was already using Ubuntu when I came on, but maybe the lesson here is to migrate to Debian, rather than bother upgrading to the newer Ubuntu release.
I'd certainly recommend that myself. Unless you have a hard requirement for an "Enterprise" distribution (RHEL or SLES), I tend to advocate Debian stable rather than Ubuntu LTS, especially on a server.
Use CentOS7 with systemd. I have much less problems and annoyances with CentOS on my own servers. They are _small_ (large problems are always addressed fast) but their number is huge. I know solution for every problem I face on Ubuntu, because I started to use Linux in 1998, but they are already _fixed_ in CentOS. Working on Ubuntu I feel like I returned back in time for few years.
Unless you have particular needs, I'd say go with Upstart. You can learn how to write a service config file in half an hour, and in our experience (running a dozen Ubuntu 12.04 for three years), it's been very stable.
There's no way of answering that question without knowing your exact requirements.
However, if you want to preserve the extreme flexibility of the daemontools approach, but with a workflow that is systemd-like (even converting systemd unit files to native service bundles), check out the nosh project: http://homepage.ntlworld.com/jonathan.deboynepollard/Softwar...
Runit is great for wrapping software that runs well in the foreground, but maybe doesn't handle being a daemon very well. This often goes for software written at / by a particular company, and software not written with large-scale use in mind. For system-level services, I usually stick with whatever the system I'm on does.
I use runit. It's great. (Daemontools didn't have the ability to sleep for 30 seconds after my daemons crashed at startup because, like, some filesystem wasn't mounted or something.)
How did you manage that? We currently have a problem with crashing services using up all system resources trying to constantly restart. Or better yet, exponential backoff.
In the recent past I worked at a place that used daemontools extensively. We used a wrapper shell script which would do the restart loop detection, then exec the actual process to run. If it detected X restarts in Y seconds (by echoing the unix timestamp to a restart log and checking it) it would "svc -d $(dirname $0)" (or something like that) and exit.
We also had a monitoring service (nagios) that would check if any services "auto-downed" on any servers and alert us.
Many people still think that it's a 3-way horse race between sysvinit, Upstart, and systemd (which systemd has all but won). The Debian technical committee certainly acted like that was the case last year.
A lot of people further neglect all the prior art in init replacements before those. The simpleinit dependency mechanism, depinit, initng, minit, daemond, Seth Nickell's GNOME experiments, eINIT, cinit and so forth.
Instead, the way it was presented in public is that the Linux distros had been battling with brittle sysvinit scripts (true, but also largely self-inflicted) for so long until systemd came in to heroically save the day.
It was pretty shocking to watch how the scene evolved from apathy to having an urgent problem that must be solved now.
OpenRC isn't an init daemon. It's only a process management framework, hence the name. It's usually used in conjunction with sysvinit as PID1. OpenRC was originally motivated by replacing the older baselayout scripts, from what I recall.
Upstart didn't come about until later from many of the alternatives I listed, and its origins were mostly in response to launchd. It was quite rudimentary initially. [1]
I never said or meant that Upstart wasn't younger than those alternatives. What I disagree with was the idea that there was much apathy and then a sudden crisis when systemd appeared. The development and adoption of Upstart and OpenRC (even if the latter isn't an init daemon, it's still an alternative to the init system) by some of the biggest distros contradicts that claim.
And i don't think many would have noticed the existence of systemd, as was the case with upstart, if the projects devs had not started lumping all manner of non-init projects into the systemd bucket.
My first encounter with the existence of systemd was while keeping half an eye on the whole consolekit+polkit+udisk rigamarole i apparently needed to get thunar (or more correctly gvfs) to automount stuff.
This when i learned that consolekit was to be replaced with logind, and logind required systemd as init.
That, to me, was a very WTF moment. It basically made a file manager dependent on a specific init being used.