Probabilistic Assertions: Crashing When Something Feels Wrong

blinks · on April 14, 2008

Yay for automated monitoring software. Nagios (http://www.nagios.org/) does this for networks (and is extendable for some other things). At my old job we used Hobbit (http://hobbitmon.sourceforge.net/) to watch our Java server instances (memory usage, etc.). There’s no reason why these monitoring programs couldn’t be used to monitor internal program statistics, as long as those stats were made available.

Generally you monitor from your internal network, and then provide some hook for the monitor to get information that’s only accessible from there. (SSH or a limited-access URL, etc.)

Monitoring programs are super-powerful and generally complex. Check them out — it’s a good skill to have when working with production software.

(I also posted this on the article.)

angstrom · on April 15, 2008

I've worked with threshold logic like that for collecting and analyzing traffic on telephone switches where an alarm or notification would be generated if the threshold was broken.

Personally I would never want to debug something like that using a statistical probability that something might have gone wrong. Better to fail gracefully with something like multiple chains so that when a request chain goes down it gets logged, cleaned up, and recreated.

Worst case scenario they get a request timeout warning.

pmorici · on April 14, 2008

This sounds like circuit breakers for software. Instead of an over current condition you've got excessive busynesses.

Tichy · on April 14, 2008

Maybe frequent backups would be a better solution?

palish · on April 14, 2008

The "delete all of a user's files" example was just one of many. It is sometimes valid to force a crash when the code detects an impossible case (or when you choose not to recover from that case for simplicity). If each component either functions correctly or the program crashes, that can significantly reduce debugging time. There aren't any silent failures from components that simply ignore invalid input.

That said, with procedural languages I try to write code that recovers from all reasonable cases and assert the rest.

Tichy · on April 14, 2008

What if the validation code itself contains the errors. It might be a fallacy to assume problems can be solved in that way.

palish · on April 14, 2008

"Solved" certainly shouldn't mean "avoided entirely". One way to write good software is to quickly recover from mistakes. So instead of staring at your code while you triple-check each corner case, you approach the problem from the other direction by testing each code path until you're confident that the code runs as it should for the inputs that matter. (Be sure to assert for all other inputs).

In that light, it is easy to see why it can be good to crash for invalid cases. Since you don't handle a bunch of failure cases, you end up writing less code. And since less time is spent in the debugger, more time is focused on the correct task: achieving architectural goals rather than solving structural problems.

After you acquire a deeper understanding of the architecture, it is best to redesign (throw away) your previous attempt. After the second and especially the third iteration you will have written a solid and elegant program in a relatively small amount of time. Programming is about trusting yourself to make decent decisions based on your current knowledge.

As with anything, there is only a finite amount of time to solve a problem. So if you don't have a lot of time then don't fret if your code isn't perfect (or reusable) as long as it works.

pmjordan · on April 14, 2008

Backups are damage limitation, and yes, they are always a good idea. I would consider this kind of strategy not only damage limitation but also prevention, which is interesting.