While this is pretty much *the* Ur-example of faulty software design causing hum...

jacquesm · on June 2, 2015

A thorough review of a software production is not an insanely rigorous pile of paperwork. I think I'm going to have to disagree with you about the kind of caution that we can draw from this incident, in fact I think cases like these should be mandatory study material for anybody that makes or moves into making software for critical applications.

I've built some stuff controlling machinery that would amputate your arm in a split second and 'faster release cycles' would have caused accidents, not better quality.

Exhaustive testing, thorough review and extensive documentation of not only the code but also the reasoning behind the code saved my ass more than once from releasing something in production that would have likely caused at a minimum a serious accident.

One of my rules for writing machinery controlling software is that I determine when a new piece of software can be taken out of my hands to be passed up the chain. The only time someone violated that rule this happened:

It was around 6 pm when we finished working on the control software of a large lathe, a Reiden machine with a 16' bed and a 2' chuck. I put the disks with the new version on the edge of my desk for 'air' (machine otherwise not powered up), 'wood' and 'aluminum' testing the next day. In simulation it all looked good but it's easy to make mistakes.

When I walked back onto the shop floor the next morning it was deadly quiet. My boss was sitting in his office upstairs and I asked him what was up. He'd taken those disks to do a 'quick demonstration' for a prospect before I arrived to show them a new feature (thread cutting iirc). A subtle bug caused the machine to start cutting with a feed of 10mm instead of 1mm, the stainless steel he used for the demo got cut up into serrated carving knives spinning out of the machine at very high speed. Amazingly, nobody had gotten wounded or killed, mostly due the power of the Reiden (it never even stalled) and the holding force of the chuck (which had to keep hold of the workpiece during all that violence), the machine had actually completed its cycle and the customer had left 'most impressed' (and probably a few shades paler than they arrived...). They actually bought the machine on the strength of the demo and some showmanship of my boss, cheeky bastard, for all the same money there would have been a couple of ambulances in front of the building that day.

After that nobody ever tried to use any of the binaries until I had signed off on them on as 'safe for production'.

That mistake would have definitely been caught in the 'wood' testing phase and a 'faster release cycle' would have missed it entirely since it looked very good right up to the moment where the cutting bit hit the metal.

Test protocols exist for a reason, skip them and you're playing with fire, faster release cycles are great for non-critical software.

angersock · on June 2, 2015

That's an excellent story, and something to remember when working on automated systems, especially industrial ones.

For something like, say, an automated surgery robot or da Vinci Surgical System, or the Therac here, or an implantable insulin pump, or whatever, it absolutely makes sense to be super vigorous in testing.

For something that's basically just a big document database, though? Or a glorified calculator? Or graphing and charting app? Or messaging app?

Hardly necessary.

In fact, the sort of testing and software rigor that makes sense for embedded systems (like your lathe or the Therac machine here) is pretty much the worst way possible to release one of the aforementioned systems on time and under budget and useful enough to actually make people productive.

Adding more "rigor" to these applications would only serve as a barrier to entry for folks trying to improve the industry. It wouldn't save lives and it would only increase the power of the monopolies of existing players.

borski · on June 2, 2015

Agreed. In fact, back at MIT, this was required reading in 6.033, one of the required classes for CS. We spent 1-2 weeks discussing it and its issues at great length. One of my favorite courses by far.

In fact, that whole course was brilliant. It was almost like a seminar where we read a ton of seminal CS papers (X Windows was one of my other favorites) and discussed them / studied them. Really one of my most memorable courses.