Crash-only software: More than meets the eye (2006)

rdtsc · on Oct 17, 2014

> he resulting system is often more robust and reliable because crash recovery is a first-class citizen in the development process, rather than an afterthought,

Because of that I usually make all my services and systems crash only. End up using things like use atomic file moves, open files with append-only, use kill -9 to stop services and so on. To make your system crash-onl,y you have to go down the base system calls.

Some observed effects so far (many are covered in the article):

* Faster restarts (if your regular operation involves restarting lots of processes).

* Less code (don't have to handle both the clean shutdown and dirty shutdown).

* Recovery/cleanup code if it is needed, is often ends up moved to startup instead of shutdown (you might have to recover corrupt files when you start up again. For example re-truncate the files to a known offset based on some index).

* Something else might need to manage external resources (OS IPC recources, shared memory, IPC message queues etc). This could be a supervisor process.

* If you do a lot of socket operations on localhost, your sockets could get stuck in TIME_WAIT state and you'll eventually run out of ephemeral ports if you do a lot of restarts (say during testing). SIGTERM signals often are caught and processes (libraries) perform a cleaner shutdown.

* Think very well about the database you use and see if it can can support crash only operation. Some do some don't ( I won't name any names here ).

mjb · on Oct 16, 2014

Crash-only software is a great concept, and this article is a very interesting summary of what it is (and what it's not). If you read only one section of Candea and Fox's paper, I would recommend section 3 "Properties of Crash-Only Software". It lays out some basic properties of proper crash-only software, which work as guidelines even for software that doesn't go all the way to the crash-only ideal.

My favorite one of the principles is "All important non-volatile state is managed by dedicated state stores". Being both crash-only (or even just tolerating crashes) and keeping state is a very difficult combination, and you don't want every one of your services needing to solve that problem over and over. Dedicated state stores let you hand this problem off, which turns many systems stateless (or at least without hard state). Tolerating crashes in soft-state-only services is much easier, perhaps even trivial if you follow the other rules.

I wrote a blog post about this paper a while back (http://brooker.co.za/blog/2012/01/22/crash-only.html), if anybody is interested.

sbierwagen · on Oct 16, 2014

I've got a billion tabs open in Firefox, (plus a bunch of extensions) which seems to expose some O(n^2) algorithm in the internals, because it becomes unusably slow after running for 24 hours. I can either quit it normally, which takes 7 minutes-- or just kill the process and restart it.

mtdewcmu · on Oct 16, 2014

Firefox is a house of cards, because all those billion tabs are sharing a single process and they all need to cooperate perfectly.

Dewie · on Oct 16, 2014

In what ways do the tabs need to cooperate/interact?

mtdewcmu · on Oct 17, 2014

You'd have to know the source code quite intimately to answer that. My hunch is that when it slows down, what you'd see under the covers is a tangled mess of locks and some kind of leaking resource(s). If the tabs were more isolated, you'd be able to at least kill the ones most responsible and the others would survive. Chrome consciously chose to isolate tabs in separate processes; having more explicit separation probably discourages tangled dependency chains and that might be the best explanation for why, even though you could, you don't have to resort to actually killing processes that often.

pessimizer · on Oct 16, 2014

They have to share the same thread. If one blocks, they all block.

mtdewcmu · on Oct 17, 2014

There are many threads, but threads are not isolated by the OS to the same extent as processes; hence, their fates are all intertwined. The user can't kill one misbehaving thread, and even if you could, you couldn't expect the program to be stable afterward.

coryrc · on Oct 17, 2014

Heh, I have attached to the process, went to the stuck thread, went up a few frames, and told it to return. Never did recover, as expected :) But before it restored all your tabs it was worth a shot at saving the state.

killertypo · on Oct 16, 2014

and when you kill the process and restart it, it will gladly restore your session and the memory will be in a better state. It will still degrade, but for a short time it's going to be awesome with all of those tabs until the leaky memory finds its way back into your browser.

sitkack · on Oct 16, 2014

Undo is expensive and error prone.

TheLoneWolfling · on Oct 17, 2014

...unless you're using (structure-sharing) persistent data structures throughout, in which case it is practically trivial. Mind you, this has other issues.

(Although, if you want to preserve undo/redo over (unintended) shutdowns, it becomes much more difficult.)

sitkack · on Oct 17, 2014

Which I would argue that a persistent data structure allows you to go back in time to a snapshot. It doesn't actually attempt to mutate back to a previous state. But yes and yes.

signa11 · on Oct 17, 2014

surprising that no one mentioned erlang in this context. the view that you don't need to program defensively at all. programs can terminate for a variety of reasons, and as long as a monitoring process / program can take corrective action it's all good.