Runit – a Unix init scheme with service supervision

donutpepperoni · on Aug 14, 2015

Runit is amazing. I've used it on several large scale websites with great success. Runit follows the unix philosophy of being stupid simple and doing one thing incredibly well. If you're starting up a new project, consider using Runit.

paulsmith · on Aug 14, 2015

Not only is it simple and does the do-the-one-thing-well thing well, which is true, it's also _correct_. It's hard to overstate how valuable runit is in production because of that. It does the right thing with regard to clearing the environment, detaching from controlling terminal, logging, and many other subtle aspects of operating a service. I never worry about runit.

freshhawk · on Aug 15, 2015

Back when I changed from supervisord to runit my life was substantially improved. That correctness means way fewer emergency maintenance ops issues in production.

general_failure · on Aug 15, 2015

What is wrong with supervisord?

freshhawk · on Aug 18, 2015

First of all and mostly, we could never exactly figure it out what was going wrong.

This was a few years ago and I don't remember all the details. We just had occasional issues with stopping/starting and especially restarting processes when pushing a new version out or when a process crashed. I do remember it could occasionally report a successful restart and still leave the old process(es) running.

All my developers became very familiar with supervisord, and it was number 2 on the troubleshooting list (1. Did we introduce a bug in a recent commit? 2. Did supervisord do something weird again).

After we switched to runit, only devs that touched ops knew about it at all. And we all forgot it was there. That's what you want in an ops tool.

dap · on Aug 14, 2015

> Not only is it simple and does the do-the-one-thing-well thing well, which is true, it's also _correct_.

How does it address the "who-watches-the-watcher" problem typically associated with service restarting?

ori_b · on Aug 14, 2015

Make it simple and obviously correct, and don't crash.

dap · on Aug 14, 2015

Just don't make any mistakes? Was that a joke?

There are many reasons a process can die that are outside of its control, including signals from outside the process, handled (but uncorrectable) memory errors, and the OOM killer (on Linux). Besides that, it seems like a major design shortcoming if fatal errors in any particular program (however critical and however simple that program may be) can be unrecoverable for the whole system.

It's definitely possible to solve this problem rigorously and completely, though I don't know of a way to do it without support from the kernel. On illumos systems, the service restarter ("svc.startd") provides a complex restart policy for user-defined services. I believe the restarter itself is restarted blindly by init, and init is restarted blindly by the kernel. If the kernel dies, the whole system is rebooted. In this way, if any software component in the chain of restarters fails, the system still converges to the correct state.

akira2501 · on Aug 15, 2015

If you're having hardware memory issues, your system is already in an undefined and unstable state.

If you send it a SIGTERM, it runs your shutdown scripts and reboots. If you send it a SIGKILL, your kernel will panic. As far as I remember, this isn't any different from init.

The OOM killer will _never_ take PID 1.

In runit all PID 1 has to do is run the service scanner. All that has to do is open directories and fork/exec the individual service managers. If it fails, it will try again in 1 second, forever. No complex logic needed. Just keep trying. In practice, it works surprisingly well.

If the individual service managers fail to run the startup script, it will try again in 1 second, forever. It works very will for most situations, but you _can_ customize this behavior easily. This simplicity is really helpful in an actual emergency because you don't get emergent behavior, like init deciding that your service is flapping and holding it down for 5 minutes.

Anecdotally, I've been using runit exclusively on all my systems (around 25 physical systems and 20-100 virtual ones depending) for at least 8 years now and I've never had a single issue.

The biggest problem I have with the design is that it puts your log services in a second level "behind" your main services, so you can sometimes miss that your log service failed to startup for some reason. This can be a real pain if your service uses blocking IO for it's stdout/stderr logging as it can cause the service to hang seemingly without explanation.

keypusher · on Aug 15, 2015

This is one of the reasons it's common to run a watchdog in HA critical systems. If something fails in supervision or at a low level and nothing is responding, nobody is around to tickle the watchdog and the entire system reboots.

ori_b · on Aug 14, 2015

What if init doesn't exit, and just hangs? What if it just goes crazy and starts erronously restarting your processes? There are more failure modes than simply crashing.

At some point, you just have to assume that some critical components are working correctly. Adding complexity just makes it harder to reason about it, or, depending on how paranoid you are, prove it.

dap · on Aug 14, 2015

That's a slippery slope argument: because we can't solve the halting problem or verify program correctness, we shouldn't try to handle crashes, either?

Agreed on minimizing complexity. The only part of the chain I described that's very complex is svc.startd, and that's largely to support rich configuration.

Also don't mistake my position for saying that quality isn't important. Rather, just that perfection is not a reasonable constraint.

ori_b · on Aug 15, 2015

> Also don't mistake my position for saying that quality isn't important. Rather, just that perfection is not a reasonable constraint.

At some point, for some component or set of components, perfection is your only choice, regardless of the rest of your design. At least when you consider a single node with a single point of failure; this is less true for a distributed system where you have redundancy.

At some point, you have to assume that either init is perfect, or that the code in the kernel to detect init failures is perfect, or that the watchdog monitoring the kernel is perfect, or whatever other layering you choose to put in place is perfect.

In a system with a finite number of components, there is always going to be a point at which you just say "this bit is going to have to be correct, and there's no other way around it".

dap · on Aug 15, 2015

I think you've misunderstood my point. The design I described does not require any component to be perfect. If any of these components (including the kernel) crashes, the system _still_ converges to the correct state.

ori_b · on Aug 15, 2015

You missed my point. At the moment, you're assuming that the kernel behavior will reliably, buglessly fall into one of two outcomes: Either lossless full system reset, or detect init has failed and restart it. You're ignoring the possibility of, deadlocks, failures in detecting that init has crashed, bugs in the special casing of init to restart it, etc. You are assuming that there are components that do certain things perfectly reliably.

You haven't gotten rid of a correctness assumption, you've just shuffled it around a bit.

vezzy-fnord · on Aug 14, 2015

At least on Linux, PID1's death is an instant kernel panic. As such, it's wise to keep any service management logic out of it.

If the svscan process dies, then your system is still chugging along and you can intervene to restore the supervision tree (otherwise svscan inspects supervise processes at a regular 5s interval). If you have some really critical process, then you could integrate a checkpointer into the run script chain so that you can just pick off from the last image of the process state with minimal interruption.

tgma · on Aug 15, 2015

> we can't solve the halting problem or verify program correctness

We can't solve halting problem or verify program correctness for all programs. For a large subset of all programs, you are able to do both. This is a very important distinction.

We even have verified C compilers (CompCert). Writing a verified service manager should be easier in comparison.

bnolsen · on Aug 16, 2015

and additionally its totally understandable. its not scared of basic shell scripts and leveraging the wealth the posix environment provides.

davexunit · on Aug 14, 2015

I use GNU dmd instead. Simple and very extensible with Scheme. I use it as PID 1 on 2 of my machines, but I use another instance of it as a user service manager on all of my machines. I've also been meaning to replace runit with dmd in Phusion's passenger-docker image to get something more hackable.

https://gnu.org/s/dmd

kragen · on Aug 15, 2015

Having dynamic memory allocation, let alone garbage collection, in my pid 1 doesn't sounds like a great idea.

(I know, as long as I'm using Linux, I'm stuck with dynamic memory allocation in the fucking kernel. I don't have a plan for how to fix that yet.)

(Also, thank you so much for all your help with Guix this last week!)

blackbeard · on Aug 15, 2015

There's nothing wrong with dynamic allocation in a kernel. It is for example better than having fixed size process tables and all the crap that comes with that.

"holy crap I've got to recompile my kernel to get more processes" is so 1995...

kragen · on Aug 16, 2015

You know what else is "so 1995"? 400-day uptimes.

blackbeard · on Aug 17, 2015

Not really. I have centos boxes that have been up for over 3 years.

nisa · on Aug 14, 2015

If you run runit you can also take a look at runwhen: http://code.dogmap.org/runwhen/

It's cron and at implemented in a quite elegant way.

Unfortunately there does not seem to be much love for it in distributions.

http://code.dogmap.org/runwhen/overview/#rationale

edwintorok · on Aug 14, 2015

There is a good comparison/documentation of what an init system like runit should do and why on the S6 site: http://skarnet.org/software/s6/why.html http://skarnet.org/software/s6/overview.html

Runit has the advantage that it is packaged in Debian and you can start using it right away.

dkubb · on Aug 15, 2015

Do you have any experience with S6? Do you know how it compares to runit?

cathexis · on Aug 15, 2015

I've used s6 a lot, though not as an init replacement in any systems that matter. As a supervisor I think it's great. For basic stuff (root process supervising supervisors supervising daemons) the two are basically identical with some superficial differences. For wider ranging stuff s6 is really good since it has a bunch of ancillary programs that solve a lot of serious issues in full system supervision (readyness notification vs polling in a run script, ucspi socket handing, etc).

cathexis · on Aug 15, 2015

I missed actually answering the question about the difference between the two (plus, that comment is in dire need of editing and the edit window appears to have expired).

The superficial differences are things like: service to logger pipe holding happens at different levels of the supervision tree (the root `s6-svscan' holds them in s6, the per-service runsv holds it in runit), `s6-svc -CMD behavior cannot currently be overridden whereas you can with `sv CMD', `s6-svscan' will immediately re-scan its directory with SIGALRM whereas `runsvdir' only polls for changes on a 5 second timer.

For basic supervision tasksboth are great, with runit being the simpler of the two in terms of understanding what it gives out of the box. For larger tasks (full system supervision, inter-service ordering dependencies, etc) s6 has the tools to make that easy whereas with runit you're going to find yourself playing stupid tricks in run scripts to get similar behavior.

pchm · on Aug 15, 2015

A genuine question, as it's not quite clear to me: can someone explain like I'm 5, what can I use runit for exactly? Is it a replacement for init.d or something more like monit? I always use monit to keep my Rails-related processes running and I've read somewhere that using monit combined with runit may be a better solution, but isn't it a bit redundant to have the two at the same time?

cathexis · on Aug 15, 2015

Essentially it's an alternative method for starting long running services (and restarting them if they fail).

The things all daemontools-inspired process supervisors do: Runs a scanner gainst a service directory containing a directory for each program you want supervised (classically these are symlinks to another part of the system). The scanner spawns a supervisor program for every directory it finds which then looks for a program called `run' inside that directory and runs it as a foreground child. If the supervisor's child stops, it runs an optional script in it's supervision directory called `finish', and then runs run again. If one of the scanner's child supervisors stops, the scanner spawns it again.

In runit's case, the scanner is called `runsvdir', the supervisors are called `runsv', and it also comes with a program called `runit' that can act as a replacement for your PID1 init whose sole job outside of boot and shutdown time is to resurrect runsvdir if it exits. Note that runsvdir is perfectly happy to be run via an entry in your /etc/inittab if you're running under sysvinit.

At their most simplistic, that's all process supervision is - a bomb-proof way of keeping services running. All supervisors in the daemontools family come with a control program to interact with the process supervisor, as well as a stdin-based logger that doesn't rely on syslog. The main benefits it brings over the classic init.d/ model are simplicity (the most complex run script I have is 14 lines of dead simple shell, most are 4-5) and automatic restartability which init.d/ daemonization doesn't have.

From what I understand from monit's manpage, it's a full-bore rule-basd system monitor. Generally speaking, process supervision is tacking one problem (keep daemons running), whereas monit is tacking another (system state monitoring). Yes monit can act as a process supervisor, but it does so by polling the system state and hooking into the existing daemonization infrastructure.

mst · on Aug 15, 2015

runit will start a daemon, and if it exits, restart it. i.e. you have runit sat 'atop' your daemon, as it were.

monit checks for things, one of which is 'is the daemon running', and can then do things about it.

runit to keep things that crash occasionally running and monit to keep an eye on general system health would seem totally sensical to me.

(see also s6 which is another runit-like)

manas952 · on Aug 14, 2015

Runit is fantastic. If you are using Chef - the runit cookbook integrates very nicely. https://supermarket.chef.io/cookbooks/runit

donutpepperoni · on Aug 15, 2015

Now that I think of it, the chef cookbook for Runit is what made it so easy to deploy to production. The 'runit_service' resource was absolutely invaluable. Forget the complicated upstart stanzas or dealing with supervisord, just write a shell script to run your program in the proper environment and ba! You've got a service!

nathwill · on Aug 15, 2015

since we're plugging init cookbooks, i'll plug a systemd cookbook[0] we're working on. would love some feedback from chef users using systemd-based systems!

[0]: https://supermarket.chef.io/cookbooks/systemd

eddieroger · on Aug 15, 2015

This was how I got introduced to runit, and I couldn't agree more. It's really amazing at what it does, and quickly has become my favorite way of keeping processes running. It's just so easy.

jlongster · on Aug 14, 2015

Nice to see daemontools-inspired work on HN in the past few days. I've been using runit for years, and as other comments have said, it's just not something I ever have to worry about. When I look around at other solutions it seems like I'd be taking a huge step back (making things more complex with arguably less usefulness).

I've been meaning to do some blog posts about runit, as it seems like it sits on the back-burner in general. Does anyone know how actively it's maintained, or has it just reached such stability that it doesn't need much maintenance? It would neat to see it on github with some real docs and such.

idop · on Aug 15, 2015

And just a friendly reminder that I wrote a shell for runit (and other daemontools-based supervisors) called svsh at https://ido50.github.io/Svsh.

lugus35 · on Aug 14, 2015

Voidlinux uses runit by default. http://www.voidlinux.eu

jedisct1 · on Aug 14, 2015

runit is what the Phusion Docker base image http://phusion.github.io/baseimage-docker/ is using, and it's the perfect tool to start and supervise containerized apps. I also love the fact that it can wait for a service to run (in order to wait for dependencies), and that stuck services can be restarted in a more radical way than a single TERM signal.

explorer666 · on Aug 14, 2015

I'm using this too (after suffering all the other supervisors), and I have no clue why runit isn't the standard tool in Linux. Can any Linux expert explain that?

vezzy-fnord · on Aug 14, 2015

Linux is a kernel. It has no standards for userspace, though some have been attempted for GNU/Linux in particular and there are de facto conventions, but that's it.

bnolsen · on Aug 16, 2015

too many politics going on, too many egos as well.

lucas_alwin · on Aug 14, 2015

Runit is the default system daemon for Phusion's baseimage-docker Ubuntu image[0]. I've used it in a basic manner and have been very happy with it.

For containers it hits a real sweet spot: lightweight and easy to use within a limited scope of processes.

[0] https://github.com/phusion/baseimage-docker

clebio · on Aug 14, 2015

Given the other glowing comments, maybe this would be a place to ask: should I bother with Upstart, or Systemd? I see Shuttleworth announced the move to systemd, but it's not available on Ubuntu 14.04 servers right now. I'm writing provisioning for our production fleet, what should I use?

The vagaries of the OS wars make something like Runit tempting.

JoshTriplett · on Aug 14, 2015

Depends heavily on what you need to provision, and whether you can select your provisioned environment based on what you want to support. Upstart is, at this point, effectively dead for any future distribution. However, if you want to support existing Ubuntu LTS distributions for the remainder of their lifetime, you'll need to handle it. Similarly, if you want to support the current rounds of enterprise distributions, you still need to support sysvinit. And if you have any non-Linux systems, you'll need to handle whatever they use as well.

On the other hand, if you can ensure that your production systems all run relatively recent Linux distributions, you can safely assume systemd. Which specific distributions you have in production will determine the oldest version of systemd you have to support; since new features get added regularly, you'll want to know the oldest version you can assume.

What are you trying to provision, and what requirements do you have?

clebio · on Aug 15, 2015

I appreciate the detailed reply, and the others as well. To address your questions and some of the others:

We can generally pin our entire fleet against whatever we want -- until we can't. One particular (very large, very important) dependency currently requires CentOS 6.5. Go figure.

As much as possible, I want to treat infrastructure as a commodity, so catering to specific versions (even LTR versions) breaks that goal. As another reply notes, if we lock in at Ubuntu 12.04 and it's predictable, that's excellent.

As another mentions, Upstart was easy to get a handle of quickly. As an aside, it bugs me that the (otherwise excellent and detailed) Upstart docs don't seem to mention that all of Upstart is now deprecated. Hence the original confusion! In general I don't want flexibility (not first and foremost), I want simple and consistent.

JoshTriplett · on Aug 15, 2015

> One particular (very large, very important) dependency currently requires CentOS 6.5.

What dependency is that? And can you run it in a single-purpose virtual machine, rather than directly on real hardware, to make it easier to manage?

> As much as possible, I want to treat infrastructure as a commodity, so catering to specific versions (even LTR versions) breaks that goal. As another reply notes, if we lock in at Ubuntu 12.04 and it's predictable, that's excellent.

You definitely don't want to cater to specific versions any more than you have to; ideally you want as few versions across your entire fleet as you can. If you could get it down to just CentOS and the latest version of some up-to-date distribution, that would help; if you could make CentOS a virtual machine under that same up-to-date distribution, that's even better, insofar as you can then manage all the physical machines identically.

> As another mentions, Upstart was easy to get a handle of quickly. As an aside, it bugs me that the (otherwise excellent and detailed) Upstart docs don't seem to mention that all of Upstart is now deprecated. Hence the original confusion!

The Upstart docs aren't going to say that Upstart is deprecated. It's more a view of the Linux distribution ecosystem as a whole: the one distribution that primarily drove upstart usage and adoption is switching to systemd, so there won't be further momentum or resources behind upstart in the future. That doesn't make it instantly obsolete, but it does mean that starting a new project today around upstart is a bad idea.

(In fairness, there's one other notable distribution using upstart as well, namely Chrome OS; however, that's a bit of a special case in several ways, and in any case I hope to change that in the future.)

clebio · on Aug 15, 2015

I'm wary of naming the dependency, because it's a vendor's solution and I'm not trying to call them out, but it's their ecosystem of tools and it has to run as the host OS, not virtualized. But all your comments are spot-on, and I appreciate it.

The Upstart thing is just somewhat odd, because it's a Canonical project, and Ubuntu is also Canonical's. At any rate, you're right that it's not immediately lost. But trying to pin against even the long-term releases of Ubuntu seems hopeless. I've been working on upgrading a 12.04 deployment to 14.04. There aren't any features that we're missing in the OS, so it's really only a concern from a security or support perspective. The shop I'm in was already using Ubuntu when I came on, but maybe the lesson here is to migrate to Debian, rather than bother upgrading to the newer Ubuntu release. (Migrating to CentOS would be somewhat more work.)

JoshTriplett · on Aug 15, 2015

> The Upstart thing is just somewhat odd, because it's a Canonical project, and Ubuntu is also Canonical's.

That's why they pushed so hard for Debian to use Upstart. Once Debian switched to systemd instead, Canonical announced that they would too.

> The shop I'm in was already using Ubuntu when I came on, but maybe the lesson here is to migrate to Debian, rather than bother upgrading to the newer Ubuntu release.

I'd certainly recommend that myself. Unless you have a hard requirement for an "Enterprise" distribution (RHEL or SLES), I tend to advocate Debian stable rather than Ubuntu LTS, especially on a server.

lisivka · on Aug 14, 2015

Use CentOS7 with systemd. I have much less problems and annoyances with CentOS on my own servers. They are _small_ (large problems are always addressed fast) but their number is huge. I know solution for every problem I face on Ubuntu, because I started to use Linux in 1998, but they are already _fixed_ in CentOS. Working on Ubuntu I feel like I returned back in time for few years.

icebraining · on Aug 14, 2015

Unless you have particular needs, I'd say go with Upstart. You can learn how to write a service config file in half an hour, and in our experience (running a dozen Ubuntu 12.04 for three years), it's been very stable.

vezzy-fnord · on Aug 14, 2015

There's no way of answering that question without knowing your exact requirements.

However, if you want to preserve the extreme flexibility of the daemontools approach, but with a workflow that is systemd-like (even converting systemd unit files to native service bundles), check out the nosh project: http://homepage.ntlworld.com/jonathan.deboynepollard/Softwar...

nathwill · on Aug 15, 2015

between ubuntu, debian and centos adopting systemd, it seems like a good idea to plan on getting familiar with it, regardless.

justizin · on Aug 14, 2015

Runit is great for wrapping software that runs well in the foreground, but maybe doesn't handle being a daemon very well. This often goes for software written at / by a particular company, and software not written with large-scale use in mind. For system-level services, I usually stick with whatever the system I'm on does.

kragen · on Aug 15, 2015

I use runit. It's great. (Daemontools didn't have the ability to sleep for 30 seconds after my daemons crashed at startup because, like, some filesystem wasn't mounted or something.)

Spiritus · on Aug 15, 2015

How did you manage that? We currently have a problem with crashing services using up all system resources trying to constantly restart. Or better yet, exponential backoff.

ploxiln · on Aug 15, 2015

In the recent past I worked at a place that used daemontools extensively. We used a wrapper shell script which would do the restart loop detection, then exec the actual process to run. If it detected X restarts in Y seconds (by echoing the unix timestamp to a restart log and checking it) it would "svc -d $(dirname $0)" (or something like that) and exit.

We also had a monitoring service (nagios) that would check if any services "auto-downed" on any servers and alert us.

kragen · on Aug 16, 2015

runit services have not only a 'run' but also an optional 'finish' script. My 'finish' script says

    #!/bin/sh
    sleep 30

I agree that exponential backoff would be better.

jflatow · on Aug 14, 2015

Runit is / was awesome but how is this news?

bitwize · on Aug 14, 2015

Many people still think that it's a 3-way horse race between sysvinit, Upstart, and systemd (which systemd has all but won). The Debian technical committee certainly acted like that was the case last year.

vezzy-fnord · on Aug 14, 2015

A lot of people further neglect all the prior art in init replacements before those. The simpleinit dependency mechanism, depinit, initng, minit, daemond, Seth Nickell's GNOME experiments, eINIT, cinit and so forth.

Instead, the way it was presented in public is that the Linux distros had been battling with brittle sysvinit scripts (true, but also largely self-inflicted) for so long until systemd came in to heroically save the day.

It was pretty shocking to watch how the scene evolved from apathy to having an urgent problem that must be solved now.

icebraining · on Aug 14, 2015

What apathy? Ubuntu and Fedora had already switched to Upstart, and Gentoo to OpenRC, due to the problems identified with sysvinit.

vezzy-fnord · on Aug 14, 2015

OpenRC isn't an init daemon. It's only a process management framework, hence the name. It's usually used in conjunction with sysvinit as PID1. OpenRC was originally motivated by replacing the older baselayout scripts, from what I recall.

Upstart didn't come about until later from many of the alternatives I listed, and its origins were mostly in response to launchd. It was quite rudimentary initially. [1]

[1] https://wiki.ubuntu.com/ReplacementInit

icebraining · on Aug 14, 2015

I never said or meant that Upstart wasn't younger than those alternatives. What I disagree with was the idea that there was much apathy and then a sudden crisis when systemd appeared. The development and adoption of Upstart and OpenRC (even if the latter isn't an init daemon, it's still an alternative to the init system) by some of the biggest distros contradicts that claim.

digi_owl · on Aug 18, 2015

And i don't think many would have noticed the existence of systemd, as was the case with upstart, if the projects devs had not started lumping all manner of non-init projects into the systemd bucket.

My first encounter with the existence of systemd was while keeping half an eye on the whole consolekit+polkit+udisk rigamarole i apparently needed to get thunar (or more correctly gvfs) to automount stuff.

This when i learned that consolekit was to be replaced with logind, and logind required systemd as init.

That, to me, was a very WTF moment. It basically made a file manager dependent on a specific init being used.