Error Codes or Exceptions? Why is Reliable Software So Hard?

agentultra · on May 30, 2011

Why not conditions and restarts as Lisp does it? Lisp bakes it into the language, but it's not impossible to implement in Java (or Blub X of choice..) according to Peter Siebel (see his google talk).

The one thing that makes working with exceptions is that when one is encountered the stack is unwound up to the first matching handler (and possibly just passed off to the next one, etc). You can lose a lot of state this way and it makes restarting computations difficult (or even deciding what to do in some cases).

Using exceptions just requires careful planning and I think conditions and restarts are a much more elegant way of solving the problem.

alanh · on May 30, 2011

To answer the titular question: Because (1) most software, and most frameworks, are designed optimistically, thinking almost exclusively about when everything goes well. And (2) because of positive bias we test cases we expect to go well.†

For example, in the Tornado web stack, it’s certainly possible to gracefully handle an exception you threw in the middle of responding to a method, but you have to basically augment the framework to do so. It’s practically unfinished in that sense. Would that have been the case if it were designed with failure modes occupying its creators’ minds even half as much as successful ones?

†The most fun example/intro to this I’ve seen: http://www.fanfiction.net/s/5782108/8/Harry_Potter_and_the_M...

zmmmmm · on May 30, 2011

> Because (1) most software, and most frameworks, are designed optimistically, thinking almost exclusively about when everything goes well. And (2) because of positive bias we test cases we expect to go well

Yes, this realization finally hit me a few years ago and started to make me reign in my use of exceptions. I realized that exceptions are all about giving priority to the optimistic case in the code - making it as clear and simple and with as little branching as possible. However after seeing projects through a full lifecycle (from initial conception through to maintenance in production) I now realize that the really important code - the code that we spend most of our time trying to figure out in maintenance over the lifetime of the application and whose state is most hard to understand and deal with - is in fact the error handling code and that exceptions focus on hiding that code and making it "disappear" and seem implicit. What seems like a win initially can be a huge loss when you're trying to write a robust system where every possible behavior is explicit and well understood.

moe · on May 31, 2011

Don't equate exception with "Error".

Many common problems, especially in network servers, can be solved quite elegantly with an "exception driven design". In those scenarios the exceptions serve as a vehicle to bubble state through multiple layers. Pretty much like an event-framework, except many people don't realize they have that baked right into their language.

A simple example would be a network-server that detects a protocol phase change at the lowest protocol level. By raising a "PhaseChange"-Exception you can quite nicely propagate such an event to the higher layers that need to know about it, without resorting to duplicated/shared state or awkward call-chains that introduce nasty dependencies and then need their own exception handling, and without running into potential synchronization issues.

ankrgyl · on May 30, 2011

Your analysis of Tornado is unfair. It offers easy ways to throw specific HTTPErrors, and those are handled exactly as they should be. If you throw an exception that Tornado doesn't know about, then it simply by default notifies the user that an internal server error occurred. It is extremely easy to design your code to throw the right http exceptions, and it is just as easy to write your own custom python exceptions that cooperate with their stack. You most certainly don't have to modify the framework itself, and tornado is designed specifically to be easy to extend.

alanh · on May 30, 2011

You're right, but doing things like falling back to another handler or doing anything meaningful is hard. That is, top-notch experiences in the face of failure aren't obviously easy.

bad_user · on May 30, 2011

I don't really like the article, but I will grant the author the benefit of the doubt as he's the creator of CouchDB. Maybe his ideas are right, but worded badly.

I lost my interest at "PHP to the Rescue". I understand the point he tried to make, but he's wrong.

PHP (or CGI/FastCGI for that matter) only works because the computation done on each request is very light and it maps well with the HTTP protocol which is at its core a stateless protocol. But it only works because the real work of indexing, searching and retrieving of documents gets pushed to the database server, a piece of software that does serious gymnastics to serve the results you want. The database server does give you ACID and it does this precisely because it can UNDO and once the database server crashes, all hell breaks loose.

A better example in this context would be Chrome, which sandboxes each tab inside its own process, such that the crashing of a tab doesn't affect the whole browser. But tabs themselves have to be long-running and stable, and those tabs are still crashing and valuable work may still be lost, not to mention the whole browser still freezes because of plugins that haven't been fully sandboxed.

Also, it doesn't warm my heart when I lose an hour's work, even if the other tabs haven't crashed.

    Don't Undo Your Actions, Just Forget Them 
    ... Use a Functional Language

This doesn't address the bigger and most important issue - some resources are inherently mutable.

Once a file is changed, it stays changed (sorry, you'll have to replace the OS to avoid it and what can I say, good luck with that). Once an email is sent, it stays sent. Once a phone call is made, you can't really pretend that it isn't. Once a bank transaction is made, you have to undo it if anything went wrong, otherwise you're facing serious penalties.

Not dealing with mutable data only works insofar you're dealing with dumb logic and not everything is as simple as an HTTP request that returns rows of comments in a simple blog.

Also, I find an article bitching about OOP and about UNDO weird at best if it doesn't reference THE OOP recipe for undoing whatever you changed -- http://en.wikipedia.org/wiki/Command_pattern

damienkatz · on May 30, 2011

Boy, you missed the point. Undo here isn't about user-undo. It's about when the program can't continue, you need undo all incomplete state changes and return to a previous state. Really this article is about the lack a transactions in programing languages, and how error conditions cause problems for software with long lived state.

The point about PHP is that it successfully does that precisely because the way it's architected. The state is kept the DB server where the transaction problem is solved.

When using long lived server processes, things tend to get in invariant states on error conditions unless you are very careful about how you are programming and dealing with state mutation and error condition. As ugly as it seems, PHP style development frees you from that.

I like your point about Chrome, it's a great architecture and it's actual quite similar original Apache PHP combo with the process isolation.

Anyway user level undo is still an important concept, but orthogonal to creating reliable software systems.

nostrademons · on May 30, 2011

You could just tl;dr the article as "Use transactions". Transactions don't have to be tied to a database, and anyone with a CS background should know what they are.

ellyagg · on May 30, 2011

God, I never get tired of the condescending "I lost my interest at" or "I stopped reading at". Always strongly reinforces in my mind how irrefutably confused and misguided the original article is, and how thankful I am to be shepherded into the light by so wise a poster. I myself am not typically able to dismiss everything someone has to say by the reading of a single point, but I've noticed a lot of other people seem to be able to.

mattmanser · on May 30, 2011

Look, this is a very long article, it really wasn't worth reading and to be honest bad_user summed up a fundamental flaw.

And I'm saying that having read it. Three times. I wish I'd taken bad_user's comment on face value but your comment and the author's comments made me read it.

Whole chapters were squirming before my eyes that seemed to make a single tiny point. Was I missing some gem of knowledge to unlock the shockingly badly written conclusion (otherwise known in writer circles as 'Um, what was I saying?')? Whole rambling diatribes could have been cut down to single sentences ('Building a Deck' and 'The Miracle Deck', I'm looking at you.). A little of my soul died inside. I thought I was missing something because he waffled so much.

I wasn't.

And 'I lost my interest at' perfectly sums it up.

This is an old article by someone who at the time wasn't a very good writer.

The core points of the article are sound, the author is right about programs being too brittle, but he took a lot of words to say it and his conclusion buried unconventionally in the middle of the article is way, way, way, way, way, way, way off the mark and the article is a massive chore to read. This is from 2006 and it shows.

zmmmmm · on May 31, 2011

> bad_user summed up a fundamental flaw

Not really. I didn't see the author ever claim you could magically control external mutable state, indeed he explicitly acknowledged that:

> Unfortunately, adding the toolbar bar to the window may not truly be an atomic operation down deep, but from your perspective it is, since you can't make the mutation operation any smaller. You may not have completely eliminated the chance of things going into a bad state, but you've minimized it as far as you can.

He just says "do the best you can with your own software, and here are some ideas about it". It's an old article, appropriate for its time (detailed because many people were new to functional concepts) and yet surprisingly prescient (the rise of functional languages did occur). Nobody forced you to read it if you happen to be some whiz-functional programmer to whom all the points are intuitively obvious. Honestly, I don't understand the negative tone in many of the comments.

cloudhead · on May 30, 2011

You really did lose your interest after "PHP to the Rescue", next time, read the article before commenting : )

cturner · on May 30, 2011

I believe this is a solved problem, although it's not used much.

EOF and Apache Cayenne are ORMs that have a notion of a hierarchy of 'editing contexts'. If you save to the top-level one, it commits to the schema. But you can create a 'child editing context'. You can then interact with this graph. When you commit it, it gets pushed up to the parent editing context. Of course, if you want you can them commit again to persist it.

This allows you to build and throw out a deck.

This creates a new problem, a human one. People are used to interacting with web applications, and having everything saved somewhere after each screen change. With this arrangement, stuff doesn't necessarily get saved. I built an application that allowed you to create an order, and then create a person associated with it, and all this in memory. Users were annoyed that when they didn't save the order, the people they'd created associated with it also got thrown out.

jamesaguilar · on May 30, 2011

He should have been persisting the work in progress too. That's not really a weakness inherent to editing layers.

smanek · on May 30, 2011

I think the ideas behind STM (http://en.wikipedia.org/wiki/Software_transactional_memory - usually used for concurrency), present some interesting opportunities to automatically 'rewind' state in the case of exceptions.

I can imagine systems where you provide multiple implementations, and the system automatically finds one that works.

zmmmmm · on May 30, 2011

STM is great, but let's face it, it's not getting that email back once it went off to the user, or "uncharging" the user's credit card (sure you can refund, but you've still made a mess on their statement).

oconnore · on May 30, 2011

That read like it was written by a Haskell programmer pretending to be an Enterprise-y Java coder™. Sort of condescending, but maybe I like it?

fferen · on May 30, 2011

Certainly better than the reverse, isn't it?

jamii · on May 31, 2011

Erlang. Damien Katz wrote couchdb.

lcargill99 · on May 30, 2011

Finite state machines make exception processing easier. They also inspire horror in those used to if() then ... else() trees, but them's the breaks. SFAIK, FSM are the difference between high reliability and ... not so reliable systems.

pnathan · on May 30, 2011

Finite state machines are limited in their ability to compute. That is why they are reliable. If your problem's solution is beyond the ability of a FSM, then you are in the same boat.

Zaak · on May 30, 2011

This sounds interesting, but I can't think of how FSMs could be used to process exceptions. Would you explain?

lcargill99 · on May 30, 2011

Each failure event becomes a state transition instead of an exception. You can then go a long way towards proving that nothing in the FSM object's state was trashed. You can also build a test harness that provides good coverage ( up to 100% ) and documentation of that level of coverage.

That's one way to do hi-rel processing in 'C'.... it all makes choice of language less of an issue. And if you have adequate logging of test site installs ( meaning all state transition data is logged ), then you can reproduce 100% of failures in a controlled environment.

It's not magic, but somehow, making the error cases explicit events has ( on me at least ) the effect of being able to reason about them more effectively. It's just another event.

joeyh · on May 30, 2011

This is an article from April 2006, so was written around the same time he was rewriting couchdb in erlang.

damienkatz · on May 30, 2011

Indeed. This article is me trying to explain what's been missing from most server side languages, it became so much more obvious once I learned Erlang. But at the time, almost no one had heard of Erlang, so it mostly just gets a passing mention.

jrockway · on May 31, 2011

I'm not sure his example for "exceptions are too complicated" is very good. When deliver mail, the operation isn't "first try the primary server, then try the backup server". It's "try the servers in ascending order of priority, randomly selecting if the priorities are the same". From that, the code becomes much simpler:

    def servers := ( sort possible servers according to rules );
    def sent_to;

    send_attempt: for server in servers {
        try {
            server->send_mail_to( message );
            sent_to := $server;
            last send_attempt;
        }
        catch Temporary Failure {
            logger->note_temporary_failure( ... )
        }
    }

    if(!sent_to)
        throw Permanent Failure ("EPIC FAIL");

The key is to use exceptions for what they're good for: unwinding the stack under certain conditions. We want the stack to unwind if there is some reason to abort mail sending completely ("network cable not plugged in"), but we only want to unwind to the point of trying the next server if the server is simply unavailable.

Using error codes alone would be complicated ("if result_code == TRY_AGAIN_LATER then { try the other server } otherwise { return TOTALLY_FAILED_DUDE }"), and using exceptions without the "sent_to" state would also be quite complicated (I'm not really sure how to even do that).

The key is to use the features for what they're good at, rather than to treat them like an ideology.

From an ideology perspective, your language's designer already decided what he wants you to do. If you can accidentally ignore an error, you're using the wrong one.

That means in Java, you tend to favor exceptions, because you get some compile-time type checking to make sure you're handling them. And if you don't handle them, your program exits at runtime, which is a good thing. Return codes, on the other hand, are easy to ignore, and your program produces undefined results if you forget to check them (and no compiler is going to tell you where that is: you'd better have good tests, tolerant users, and no deadline).

In Haskell, on the other hand, the ideology is reversed. If you throw exceptions in IO, they basically work like exceptions in untyped languages -- your program exits unless you were lucky enough to have a runtime handler. (If you throw them outside of IO, like with "error", you're double fucked. You can't even catch them outside of IO, which basically means Never Do That.)

But the good news is, there are type-safe return codes. You define a type like:

    data Either a b = Right a | Left b

and then you make your functions return Either the Right answer or an error. If you have a function that returns an error code, you can't do anything with the result until you intentionally handle the error. If you forget, your program simply does not compile -- there's no way to ignore errors except by explicitly writing code to ignore the error.

And, of course, this is generally hidden from the programmer with monads! You don't even have to think about handling the return codes: the language does it for you.

(C is a weird case where both exceptions and error codes work poorly. For that reason, I make sure to write C in very tiny pieces that can be composed with a safer high-level language. In that case, there are only a few places where errors can occur, and they tend to be simple like "tried to allocate memory, it didn't work, so i deallocated everything else" or "failed to send to the socket: EAGAIN". The complex logic like "failed to send to primary MX" should be handled higher up in your stack.)