Hacker News new | past | comments | ask | show | jobs | submit login
Worst Computer Bugs in History: Therac-25 (2017) (bugsnag.com)
301 points by dangom on Aug 11, 2018 | hide | past | favorite | 110 comments



As terrible as it was, that Therac-25 remains one of the most frequently cited examples of software engineering flaws hurting people is somewhat encouraging for the profession. 3 deaths is a tragedy, but the Hyatt bridge collapse a year earlier was a couple of orders of magnitude worse (114 people, https://en.m.wikipedia.org/wiki/Hyatt_Regency_walkway_collap...) from what was also a fairly subtle engineering failure.

IMO, harm from software bugs (so far) have been vastly surpassed by explicit choices in system design. The various emission cheat scandals have almost certainly taken a real toll on human life going into the hundreds of person lives. More subtly, the choices to retain data inappropriately at Ashley Madison (probably) lead directly to suicides and serious emotional harm. Those are just the two recent examples that spring to mind as a practocing developer, not an ethicist.

To somewhat over simplify but when discussing engineering ethics, the harm from software developers building things wrong is swamped by building the wrong things.


I think that's because for most applications where bodily harm is a possibility you generally (in my experience) have hardware protections that will prevent the software from doing anything stupid. Take an elevator for instance, even if the software controller is bugged (or hacked) and decides that it should drop the cabin from the top floor to the ground level at full speed there are hardware protections (security brakes, limitations on the engine itself etc...) that will take over and make sure nobody gets hurt. Therefore in order for something to go completely wrong you need both a software and hardware failure. The main flaw in Therac-25 was arguably that no such protection was present, the hardware should have been designed in order to make the bogus configuration impossible to achieve solely in software.

I think unfortunately this is going to change with the advent of "AI" and related technologies, such as autonomous driving (we've already had a few cases related to self-driving cars after all). When the total enumerable set of possible configurations become too great to exhaustively "whitelist" we won't be able to have foolproof hardware designs anymore. In these situations software bugs can be absolutely devastating.


I think unfortunately this is going to change with the advent of "AI" and related technologies, such as autonomous driving

Yes, the potential cost of software bugs is increasing as software does things that no hardware interlock can stop. And worse than that, as a society we largely haven’t realized we need to optimize for worst case (not average) performance of algorithms, because they WILL be attacked. If you’re lucky, they won’t be attacked by sophisticated, well resourced nation-state attackers. But sometimes that will happen.

The rise of complex algorithms to control complex processes is the real difficulty. Facebook’s banning algorithm is an example of something that has been exploited by attackers.

Let’s hope voting software is not the next target where bugs can be exploited. Because changing political decisions can and does produce life-changing effects.


Your point rings true even in this case. There was another Therac (50? 100? It’s been a while since I read about it) machine which had the same bug, but where noone got hurt due to hardware safeguards.


In my opinion, one of the most tragic aspects of these horrific incidents is that the predecessors of the Therac-25 actually had independent protective circuits and other measures to ensure safe operations, which the Therac-25 lacked.

Here is a quote from http://sunnyday.mit.edu/papers/therac.pdf:

"In addition, the Therac-25 software has more responsibility for maintaining safety than the software in the previous machines. The Therac-20 has independent protective circuits for monitoring the electron-beam scanning plus mechanical interlocks for policing the machine and ensuring safe operation. The Therac-25 relies more on software for these functions. AECL took advantage of the computer's abilities to control and monitor the hardware and decided not to duplicate all the existing hardware safety mechanisms and interlocks."

So, regarding these important safety aspects, even the Therac-20 was better than the Therac-25!

The linked post also mentions this:

"Preceding models used separate circuits to monitor radiation intensity, and hardware interlocks to ensure that spreading magnets were correctly positioned."

And indeed, the Therac-20 also had the same software error as the Therac-25! However, quoting again from the paper:

"The software error is just a nuisance on the Therac-20 because this machine has independent hardware protective circuits for monitoring the electron beam scanning. The protective circuits do not allow the beam to turn on, so there is no danger of radiation exposure to a patient."


I have friend who has 40 years programming experience, he is building a computer controlled milling machine in his basement.

When I asked him about the limit switches it turns out they are read by software only and the software will turn off power to the motor controllers if a limit switch is activated.

I asked why he does not wire the switches to cut power directly to be on the safe side.

His answer "It's to much bother to add the extra circuits."

We are talking less than $20 in parts and a day of his time. If the software fails after sending the controller a message to start moving the head at a certain speed then crashes there is nothing to stop the machine wreaking itself.

E.C.P.


While it's not terribly uncommon for small hobby machines to depend on software limits I can't think of a single instance in my years of maintaining "real" production CNC machines of encountering a machine that didn't also include hard limits.

Limit switches typically include two trip points. The first is monitored by the control system; when it is tripped the control halts execution and stops the machine. The second limit is wired directly to the servo amplifier so that, if for what ever reason, the control fails to halt the machine when the soft limit is tripped power is removed and motion is halted. Both limits are fail-safe such that if they were to become disconnected it would result in a limit exceeded condition.


In a situation like that, I wouldn't blame him. Consider while building his milling machine how many of these situations he will come across. If he had to make sure there was a hardware failswitch, it would simply not scale.

3D printers are like this too. They have mechanical limit switches [0] that are read only by software. So if there is a bug in the software, nothing is stopping it from pushing the hardware limits and breaking. Same goes the other way around, if this switch is broken, same might happen.

[0] https://i.ebayimg.com/images/g/EYAAAOSwbopZguz4/s-l300.jpg


Most 3D printers don't have massive printing heads. If they drive into the end-stops, the motors will likely just skip steps and be stuck. They are not designed to apply much force.

I'm much more worried about the heating element. Its temperature is usually controlled by the same cpu that also does motion control and g-code parsing. If anything locks up the CPU the heat might not be turned off in time, and (because you also want fast startup) there is enough power available to melt something. At the very least you would get nasty fumes from over-heated plastics, and maybe even teflon tape, which often is part of the print head. At worst it could start a fire.


As a 3D printing enthusiast, I can confirm your fears. It's all in software and while there are good control systems, nothings perfect. I had the hotbed fail and it was smoking when I found it.


Note that (depending on how the 3d printer is wired) a defective switch will result in an "endstop hit" condition. A dislodged switch however will happily keep reporting it's not being hit.


Probably not that big a deal. An operator/program can easily damage a CNC mill by jogging a substantial tool (or the spindle itself) into solid material, which is a much more likely scenario.


...but it's not "much bother" to run the control loop through the complexity of software? This seems to be the equivalent of overengineering in software, where something straightforward is instead performed through many layers of abstraction and indirection.

The straightforward way of implementing this with a bidirectional motor is to wire normally-closed limit switches in series with their appropriate direction signals, such that when the switch is actuated it prevents the motor from going in that direction, but it can still move away from the switch.


Try that with stepper motors used in 3d printers. Try it with an off the shelf driver. Make it so the limit switch only stops motion in one direction.


So that's where the youtube videos of CNC milling machine failures come from! TFA noted that major causes of the disaster was the culture and failure to independently unit test the machine.


This seems like a good time to mention that God gave us hardware interrupt inputs and watchdog timers.


> It’s been a while since I read about it

It's in the article you're commenting on.


>the Hyatt bridge collapse a year earlier was a couple of orders of magnitude worse (114 people, https://en.m.wikipedia.org/wiki/Hyatt_Regency_walkway_collap...) from what was also a fairly subtle engineering failure.

It wasn't subtle at all. The entire design was substandard to begin with and didn't meet code. Suspending a walkway from a piece of square tubing made by welding two pieces of C-channel together was ALREADY pants-on-head retarded, and undersized to boot. Deciding it's OK to hang the lower span off the upper span's substandard tubing was just the last step in a long chain of gross engineering negligence.

Calling it a "subtle" failure is like calling Challenger subtle because it was "just a leaky O-ring", or the Apollo I fire subtle because it was "just a tiny spark". There was a completely avoidable cascade of multi-level failure leading up to all of them.


Mechanical engineer here. I don't think the Hyatt Regency bridge collapse was caused by a subtle problem. The design change should be obviously bad to any practicing civil engineer. Unfortunately far too many engineers don't perform even basic sanity checks. I'd say a better engineering culture would have caught the problem. Things like this are why I am becoming more and more into testing.

Of course, as you have said, building the wrong thing swamps other harms. Unintended consequences are hard to predict, unfortunately, but I am interested in ways to improve this situation. Standards design also interests me, particularly standards which are hard to cheat.


The number one way to prevent building the wrong thing is a professional code of ethics, which software engineers (at least in the US) do not yet have.


May I point you to the ACM/IEEE-CS Software Engineering Code of Ethics https://ethics.acm.org/code-of-ethics/software-engineering-c... this was a major thing discussed in my professional ethics corse in college.


It's a nice gesture. Unfortunately, it is not required for software engineers to pass any test on this or subscribe to it or otherwise be held to the standard before they are allowed to write and publish software, so unless and until someone forfeits a lot of money and/or their liberty, it will likely not be widely adopted in any practical sense.


I've never seen this before, this is really interesting. Thank you for the link!


I am personally more concerned with software engineers and network engineers aiding and abetting the imprisonment, torture and execution of people by repressive regimes, by enabling surveillance technology and fucking with internet traffic analysis. Way more people are going to be hurt in the near term by that than by therac-25 type mistakes.

For example if you're a Chinese network engineer, and you can avoid it, don't take a job setting up tracking and database of Uyghur people. That is an ethical issue just as important as the therac-25 type problem.


You don't have to look that far. The ICE detaining children and violating them are already crimes against humanity. I bet there are IT people working for that agency.

They killed at least one child and are drugging them against their will, while they are forcibly taken from their parents and held in worse conditions than terrorists.


In addition to the ACM and IEEE, the National Society of Professional Engineers also has a code of ethics. However, relatively few engineers get licensed in the US--it's mostly needed for signing off on drawings for regulators and that sort of thing--and, in fact, the Software Engineering exam is being phased out.

(I took the Engineer-in-Training exam once upon a time but then stropped practicing engineering so I never got the PE.)


In Portugal we do have it, and while it is optional to join it if you don't legally sign off projects on the company's name, at least they certify that University degrees are actually teaching proper software engineering.


Do you say that to give poor software engineering legal consequences?


Not the poster, but I think there are multiple paths of action encouraged by a code similar to other engineering disciplines[1].

More involvement of the legal and insurance industries are one of them. Another is to give software engineers something solid to brace themselves on when pushing back at management - completely aside from consequences for the company, if you're bonded or worry about a license, there are some things you won't let your manager sweet-talk you in to. Another is to provide a model of behavior for engineers, like it says on the tin. It doesn't mean everyone will, or even that the model is always absolutely correct. But giving folks a way to think about things when they feel something's off is a good thing.

Yet another is theoretically providing a baseline of competence. I think that depends more on improving informal culture than any formal mechanism, though.

[1] Note: not arguing in favor of one; I haven't made up my mind on what I think about the topic.


In Canada, that's the legal definition of engineering. You may not call yourself an engineer without accreditation and such accreditation will be rescinded if you make severe enough engineering mistakes.


I don't disagree with you, but there's a LOT of people in Canada calling themselves software engineers or network engineers who don't have a degree qualified to wear the iron ring.


That's about being a Professional Engineer, which legally entitles you to certain actions (e.g. sign off on design docs). Anyone can call themselves engineers of any kind as long as they don't pretend to be P.Engs. I am not sure though that being P.Engs in software means anything in practice, unlike in e.g. structural engineering.


which software engineers (at least in the US) do not yet have

"It is difficult to get a man to understand something, when his salary depends on his not understanding it." -- Upton Sinclair


Seems like now we could have CAD software perform static analysis on the design as well as physics simulation "unit tests" in order to augment testing


Yes, that would be ideal. My impression is that CAD software usually integrates with another software to do stress analysis, e.g., FEA software, so it's more complicated than you've described, but very possible.


How far advanced are methods to automatically setup FEA simulations on arbitrary inputs? The parameter space for FEA methods is pretty large and things like meshing can go terribly wrong and lead to utterly wrong results. Trying to run simulations in the background while a building is being designed seems like a goal to strife for, but to be successful, the software needs to be able to perform equivalently to an experienced engineer without guidance. That's a tall order.


Sorry, I wouldn't really know, as I work in fluid mechanics, not solid mechanics. In fluids I'm getting the impression that automated meshing is becoming fairly robust in some circumstances, to the point where I believe it is sometimes better than an experienced engineer. Solids seems easier in this respect, but as I said, I don't really know.


I mean, even if the software could do something like flag the hyatt bridge redesign as failing to meet weight specs, that would be a win. I could be wrong but it seems like certain civil engineering projects (such as indoor bridges) would be pretty simple to check the math on compared to a building subject to wind, etc.


Damage-by-software usually isn't spectacular (and therefore not likely to get noticed) or not necessarily very directly costly in terms of human lives, but I'd argue it's actually more significant in the long term and in the grand scheme of things. Software rules everything and even slight errors or inefficiencies have absolutely incredible incidental cost.


We’re already seeing bugs kill or maim with autonomous cars. You can be sure there is far worse data associated with military system flaws that we don’t know about.

In any case, it’s dangerous to explain away software failure by dumping blame on systems. The lack of professional standards in software makes it easy for people to do bad things well.


The difference here is that you treat one person at a time, and the bug doesn't get triggered every time. A bridge just happens to be used by a lot of people at the same time. The bridge didn't collapse three times until people figured out something was wrong.

This applies to the vast majority of fields where software is in use, except maybe planes, trains and nuclear power plants (where I sure as hell hope are a bunch of hardware safeguards in place). So in a sense, we software developers just got lucky here that our mistakes only kill one or very few at a time in most use cases (if at all).

It's still insane how according to the article they apparently just had that thing developed using emulated hardware with no proper security audits, safety guidelines or formal verification.


Software is honestly cheaper and easier to test, other kinds of engineering tests run far more expensive in time and materials, and simulations aren't perfect and can't replicate all real world conditions.

Definitely agree on the explicitly bad choices though, and since software's impact is often very subtle it might really be impossible to gauge exactly how bad some of those choices end up being.


I read this article, and many years ago the full report, and one of the omissions on the list of causes that stood out to me was overcomplexity --- if you read about the possible functions of the machine, they really don't require multiple threads much less a full multitasking OS. None of these race conditions would've occurred if it was a simple single-threaded embedded controller.

To paraphrase an old Hoare quote, software can either be so simple it obviously contains no bugs, or so complex that it contains no obvious bugs.


One of my favorite software horror stories is the one of $32 _billion_ overdraft by the Bank of New York.

From "Computer-Related Risks" by Peter G. Neumann, published 1994 (REALLY recommended reading)

"One of the most dramatic examples was the $32 billion overdraft experienced by the Bank of New York (BoNY) as the result of the overflow of a 16-bit counter that went unchecked. (Most of the other counters were 32-bits wide.) BoNY was unable to process the incoming credits from security transfers, while the New York Federal Reserve automatically debited BoNY's cash account. BoNY had to borrow $24 billion to cover itself for 1 day (until the software was fixed), the interest on which was about $5 million. Many customers were also affected by the delayed transaction completions."

Additional reference: https://www.washingtonpost.com/archive/business/1985/12/13/c...

Granted, no one died because of this but ... wow ... that was bad day for some developers somewhere.


Imaging being called in. Ok guys we don't know what the problem is but it's costing the company $3500 per minute the bug stays unfixed just in interests. No pressure.


I was on a conference call with a bank that due to a number miscommunications and general idiocy was under the impression that they were possibly in violation of some banking laws (no comment on what laws) because they misinterpreted what exactly was happening. Their lawyers supposedly told them they were at risk for an unbelievable sum of fines and criminal stuff.

In short they saw some data they didn't understand that seemed to indicate things at the bank were very much not how they were thought to be, and indicated something very specific was happening. In reality that was not the case but a lot of assumptions by morons somehow were believed and everything snowballed into lots of corroborating "evidence" that seemed to indicate bad things. This was a bank that rarely had this level of stupid information and assumptions rise to the top so nobody actually questioned it no matter how absurd it seemed.

My presence on the call was purely because the bank wanted every vendor they had looking to see if they saw problems, so I was basically just poking around telling them what i saw from my end.

At one point on the call there was a dude in an unmanned data center literally just flipping off power switches and cutting the cables of various equipment as he was given the locations for it. It was frantic and very unamerican bank like (at least the ones I worked with were pretty cool customers normally). I turned my speakerphone on to let my coworkers listen to the chaos.

In the end, it was a stupid SQL server related virus thing that created some good old fashioned network disruption and when that was solved and nothing looked like the sky was falling, everyone came to their senses. Once the dude pulled the power to enough of them everything calmed down and I checked my bank account and it was all there... but no excess either :(


Also the follow up question should be ask: was an engineer who stopped the leak and fix the bug awarded to some reasonable point??


No because it was also caused by an engineer, so this could lead to terrifying incentives.


How is this different from rewarding a salesman who rescues a sale that a different salesman had botched?


Because you can't purposefully botch a sale in order to later recover it. Also because you can't avoid botching sales by being more conservative or adding more process. In short, sales and engineering have basically nothing in common.


I don't think GP was talking about a scenario where the same engineer who created the bug fixed it and gets rewarded, rather one where a different engineer fixes it. Of course it wouldn't make sense as you describe it.


How do you decide objectively who is responsible for every single bug? The whole thing is ripe for abuse from all sides. You need a blameless culture to have good engineering, not a bounty-based one.


Getting paid a premium to go in and fix other people’s mess is just a regular consulting gig.


Why should they be? The difficulty and quality of the work isn't dependent on the severity of the bug. Basically you would be creating a lottery.


Wow. Any idea what would've happened if they hadn't gotten it fixed overnight?


Therac is one of the reasons I get nervous about "health hacking." Yes, people can verifiably benefit from some of the advancements made in this movement, like the DIY diabetic insulin pump, and yes, I prefer to see such advancements be open source than locked up in proprietary designs and trade secrets. And there probably is room in health regulation for trimming the red tape anyway even for innovations originating from the commercial sector.

On the other hand, when corners are cut (no hardware interlocks, for example) and edge cases aren't considered, even innocently, then you get things like this. It makes products more expensive to design and more costly to buy and maintain to do the extra engineering. It is certainly a barrier to entry, too. But do we want another case like this because people said "this is good enough"?


I'd say Brad Kuhn got it right in http://ebb.org/bkuhn/blog/2016/08/13/does-not-kill.html including "[Y]our proprietary software has killed people, both quickly and slowly, and your attacks on GPLv3 and software freedom are not only unwarranted, they are clearly part of a political strategy to divert attention from your own industry's bad behavior and graft unfair blame onto FLOSS".


I posted a link about a insulin pump that could be hacked into remotely. Even if it isn't 'health hacking' because you've bought a product from what you thought was a reputable vendor there are no guarantees that it will be secure and bug free.


https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3262727/

> An unauthorized third party can interfere with pump communication and undermine patient safety

> we confirmed this through laboratory experiments by sending commands to an insulin pump using an unauthorized remote programmer at a distance of 100 ft

> Thus, the specifically identified issues are a security breach that could result in:

> (1) changing already-issued wireless pump commands;

> (2) generating unauthorized wireless pump commands;

> (3) remotely changing the software or settings on the device;

> (4) denying communication with the pump device.

People can also attack the blood glucose monitors and the data they report to the pump system.

Scary.


The crazy thing about this classic story is that the industry has learned nothing from it: The lethal bugs were all in the frontend UI code.

Today, companies build equally important UI logic in JS frameworks that target rapid prototyping and consumer-focused startups.


I don't think you can say the lethal bugs were all in the frontend code. First, machines of that age didn't have as clear of a distinction between front and backend. Second, any good back end has bulletproof safeguards against bad frontend input. It makes more design sense to safeguard the backend against spamming input than the frontend, because the frontend is more likely to require redesign, and multiple frontends can interact with the same backend.

More than anything else, this accident shows the importance of fuzz testing your critical logic, the importance of hardware interlocks, and the importance of multiple independent layers of interlocks.


Lawful punishment of bad quality software needs to be a thing, just like in other industries.

Only then will most companies actually start to care about software quality in their development processes.


Other industries are also free to make terrible products, they're just not allowed to hurt people. In that regard, software isn't that different.

Crappy software just doesn't physically injure people very often (compared to like, lawnmowers), and that's where the most serious legal liability for products comes from. Monetary damage from software gets worked out the same way any contract dispute gets worked out, or the same way a physical product that doesn't work but doesn't hurt anyone would get worked out.

Liability for bad software is also complicated by the fact that there's a million apps out there that are free to use. If they are broken for however long, it's hard to say that it cause monetary damage to anyone. (If anything, people are saving time... to paraphrase Mitch Hedberg, FB is broken, sorry for the convenience.)


Most people surely wouldn't buy a physical product that doesn't work as expected, or in the worst case they will return it and expect to be fully refunded.

If everyone did the same thing for broken software, instead of being conditioned that broken software is unavoidable, the quality across the industry would be much better.


I'm pretty sure this incident killed the company. I'm not sure what more "lawful judgment" you want - hold the individual developers legally liable?


If your company's motto is "move fast and break things" then you can't punish the developers for bugs.


The company is the one facing the courts not the employees.


I don't think it works, at least not within the current legal system. Where it becomes mostly about legal bureaucracy of avoiding responsibility, rather than truly focusing on reliability.


Sure it does, it is no different than when a company delivers spoiled goods or when one does returns at a shop because the product does not work as described on the box.

The root problem is that society got used to turn off/on and hope for the best instead of going back to the shop and ask for their money back.

Also every time that there is an bunch of black hat hackers that expose company internal data, if the security breach can be mapped into a CVE database entry, a good law firm could probably make something out of it.

Not all jurisdictions are alike, but one needs to start somewhere.


It's a huge case of Stockholm Syndrome, end users have been conditioned over years to accept these things as normal and have become engaged in an abusive relationship with their captors who will withhold the little help they are prepared to give if the users dare to complain.

No other industry has ever gotten away with this. But with 'software eating the world' change is just around the corner, the first software bug that will kill a few thousand people will be a very rude wake up call that something needs to be done.

The only industry that really gets it is aviation, medical tries hard but is still a mess, with the exception of devices, in general those are engineered reasonably well.

In a way all these SaaS products are setting the stage for some real liability, after all, if the end user doesn't have even a modicum of control over what happens with their data then the other party should assume liability, even if they try real hard to disclaim that.

Open source might get exempted, if not then I suspect that a lot of open source projects will fold.


I am betting on IoT as the final trigger.


Self driving vehicles, controlled by some griefer would be a pretty harsh demonstration target as well.

There is no way I'll drive an internet connected car, unfortunately I still have to share the road with people that do drive internet connected cars.


> a good law firm could probably make something out of it

But that's what I'm saying. If there is a possibility of legal action there will be enough legal bureaucracy to make sure there is something to show in court and avoid responsibility, but not to actually address the problem.


Probably not. As a company you just disclaim liability in your terms of service.

Jurisdictions that try and override this, simply get excluded from the customer base.

The market is still the ultimate decider for quality; if you build a crappy product, expect to get innovated out.


> As a company you just disclaim liability in your terms of service.

Judges might disagree.

> Jurisdictions that try and override this, simply get excluded from the customer base.

Until the customer base is the EU or the US.

> The market is still the ultimate decider for quality; if you build a crappy product, expect to get innovated out.

The market has utterly failed to decide for quality, the market is mostly interested in price and marketing power, quality has never been a very large factor, though in a mature market it might allow some manufacturers to charge a premium for their products.


>Probably not. As a company you just disclaim liability in your terms of service.

Thankfully EULAs are void in Europe.

It is all a matter how big the customer base gets, I am hoping eventually we get something like that EU wide.

> The market is still the ultimate decider for quality; if you build a crappy product, expect to get innovated out.

If that was true 1 € shops wouldn't exist, but even those products have more testing than most software out there.


> Thankfully EULAs are void in Europe.

It's not so clear cut that you should be thankful. The ability of companies to dictate the terms of which users can use their software, affects their risk calculation to produce the product in the first place. It is very likely that very useful but imperfect software will not be written because the risk / reward balance is tilted.

Remember, you always have the ability to reject an EULA; simply don't use the product.

> If that was true 1 € shops wouldn't exist, but even those products have more testing than most software out there.

Consumers can make value choices on quality vs cost. This is a basic market function.


As a customer I've found just ignoring all EULAs to be effective on the flip side. They are meaningless in my opinion and I don't give a crap about what it says. I'll use the software as I want.


Than let it be so, software shouldn't be a special snowflake with special rules regarding commercial obligations.


In 1973 I programmed for the oncology department at L.A. County/US Medical Center. We had a Varian Clinac linear accelerator with computer-readable and -drivable motors. The clinicians would manually position the Clinac (with the patient on it) for the first treatment, and the position would be saved in the patient's computerized file and restored on subsequent treatments.

For some treatments, a metal wedge would be placed within the beam to attenuate it more at the thick end of the wedge. Because of the non-linear attenuation along the length of the physical metal wedge, dosages were difficult to calculate.

Someone got the bright idea of creating a software wedge by slowly moving the treatment couch at the same time as closing the beam aperture, so that there would be 100% exposure at one end of the "wedge" and 0% at the other, with a linearly decreasing distribution across the whole wedge.

I was the programmer for this project, and we had just started testing it with a sheet of X-ray film on the couch when I received an offer I couldn't refuse to go work elsewhere.

I'm glad that I departed before they started using this on live patients.


> users will ignore cryptic error messages, particularly if they occur often

It's not just cryptic error messages. Pretty much anything that requires the attention of people will end up being ignored eventually. For example:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4894506/

I've also read about an anesthesiologist who turned off the alarms because they annoyed him. One day he failed to secure the endotracheal tube during a surgery, it came off and nobody noticed. The result was cardiac arrest, brain damage, multiple organ failure, sepsis and death.

Monitoring hardware is very sensitive so it will fire off alarms if anything changes, no matter how small. The more sensitive a test is, the more false positives you get. This is extremely demanding of a health care professional's attention, which in practice is multiplexed between countless patients.

Up to 99% of these alarms and messages will do nothing but get in the way of people. These represent false positives, disconnected cables, and other minor failures that don't represent a real danger and can be easily fixed. People will get used to the alarms, and will learn to ignore them.


> Pretty much anything that requires the attention of people will end up being ignored eventually.

Perhaps, but your example is not supporting evidence. The PACU alarms were muted, precisely because they were so hard to ignore.

> Monitoring hardware is very sensitive so it will fire off alarms if anything changes, no matter how small. The more sensitive a test is, the more false positives you get.

This is fixable. The problem is the same as the one in the Therac-25 case ... severe and inconsequential alerts are indistinguishable.

Here's an particularly enjoyable piece of literature crafted around an instance of alarm fatigue: https://gutenberg.ca/ebooks/smithcordwainer-deadladyofclownt...


According to the Wikipedia entry on the Therac-25, it was "In response to incidents like those that the IEC 62304 standard was created, which introduces development life cycle standards for medical device software and specific guidance on using software of unknown pedigree".

For those working in safety and quality control of medical systems, how much does compliance to those specifications actually diminish the chances of another Therac-25 incident?

Considering that automation continues to increase, from automatic patient table positioning up to diagnose-assisted AI, are there new challenges when it comes to designing medical systems in order to keep them safe and maintainable? How likely is it for the FDA or the equivalent agencies around the globe to authorize the use of open source systems?


> How likely is it for the FDA or the equivalent agencies around the globe to authorize the use of open source systems?

Actually they already authorize stuff like Qt.

Computer systems where human lives are put in risk belong to what is called High Integrity Computing.

There are very strict coding standards, where even C looks more like Ada than proper C.

https://ldra.com/medical/

https://www.qt.io/qt-in-medical/

https://www.vectorcast.com/testing-solutions/software-testin...

Source code availability is not an issue, because it is part of the certification process to provide it.

The problem is having the money to pay for a certification, which becomes invalid the moment anything gets changed, namely compiler being used, source code, or if any of the third party dependencies gets updated.


How horrible it must have been for the operator, to realize they had killed two patients, through no fault of their own.


Honestly, I disagree slightly. Reading the article as well as the original report years ago, I wasn’t left with the feeling the operator made “no fault of their own”. Are they to blame at all, no, but the operator certainly made mistakes. For example, assuming an error is innocuous when you are intentionally delivering radiation to a person is careless at best. Again, the machine is at fault solely, but that doesn’t mean the operator didn’t have a role in the death.


Yeah, I took away the same thing. As an example, in the aviation industry something like this would simply not be tolerated. When you are operating a potentially dangerous device, you have to do so with the utmost care. This isn't to say the technician should be punished, but one of the results of this investigation should have been a focus on making technicians aware of how disastrous the consequences could be if they don't respond appropriately to an error.


Really?

I don't think it's reasonable to expect nurses to wait an undocumented 8 seconds after changing modes to avoid a race condition. That goes far past "utmost care". Are pilots expected to never overlap command inputs? Are they allowed to engage the flaps and then activate the spoilers before the flaps are fully deployed?

I'm basing my account on this report as well as the OP: https://hackaday.com/2015/10/26/killed-by-a-machine-the-ther...


I'm pretty sure the parent only meant that the "Malfunction 54" error should not have been ignored, not that the operator should have somehow avoided the race condition in the first place.


The operators had become conditioned to ignore those error/warning statements due to their pervasiveness and apparent lack of consequence. This is why, as a designer, you should use such warnings sparingly so that the operator/user doesn’t become “blind” to them.


To add to this, the more specific and informative an error message is the more authoritative it appears. "Malfunction 54" is nowhere near as good as "Unable to to set therapy mode"


Yep. A good book on this sort of thing is Tragic Design [0]. I was in a pilot "Software Safety" course last year and this book was published just after, but was an excellent companion text for the course. I've been meaning to follow up with them to see what became of that work (the pilot course, my employers at the time were considering making it a mandatory/highly-recommended course for most of their software engineers and designers).

[0] https://www.tragicdesign.com


Still doesn’t negate the fact the technician should’ve still checked. Again, not saying the technician is at fault at all, but they aren’t free of being involved in the death either.

I’ll use a terrible analogy to make the point, if someone tells you “that’s bad” if you pull the trigger on a revolver playing Russian roulette and you pull the trigger five times without any apparent consequence despite being informed “don’t do that” each time, are you completely not responsible if you pull the trigger the sixth time and die?


"if someone tells you “that’s bad” if you pull the trigger on a revolver playing Russian roulette and you pull the trigger five times without any apparent consequence despite being informed “don’t do that” each time"

I'm not sure that's a valid analogy. Isn't it the case rather that the revolver said "that's bad" but a person said "it's fine, ignore that, it always does that"? Which, in my experience, happens all the time in the workplace. If you didn't trust other people you work with, it would defeat the purpose of being in an organization at all.

Not to mention, if pulling the trigger repeatedly was required for normal operations, you really can't blame the operator at all.


> Which, in my experience, happens all the time in the workplace.

If you are doing work that could immediately cause someone’s death, I feel you have a duty to double check and not just go with the flow. Ymmv and I’m not claiming my POV is more right, just that it’s my POV.


Exactly. I don’t blame the technician in the slightest for the race condition, I just called out that ignoring the error state notice was at best careless.


The pilot example probably isn't the best one.If a plane with that design was nonetheless deemed airworthy by the FAA, pilots would absolutely be expected to know the correct operating procedures for that plane.

Look at the "unusual" choice of averaging inputs from the pilot and co-pilot that helped lead to AF 447. It's very reasonable to argue it's bad design, but it was the responsibility of the pilots to know how it worked.


Again, this was a completely undocumented (and unintentional) race condition. There was nothing in the manual stating that "the machine will do bad things if you try to change the mode twice in less than 8 seconds". Averaging of control inputs was a documented, intentional design choice. Being "expected to know the correct operating procedures" doesn't extend to "know how the machine works better than its designers" or "be omniscient".


Right, I meant that the issue was not responding safely to a malfunction. If it was undocumented, the operator should have paused the operation and attempted to contact the manufacturer. I don't blame them for their response necessarily (in the same way I don't think the programmers were actually guilty of a crime), but I think that should have been the main takeaway from this.


Yes. There is a huge difference between undocumented and documented.

And also between pilots flying a plane they are certified on and nurses who are probably less specialized.


> Due to the frequency at which other malfunctions occurred, and that “treatment pause” typically indicated a low-priority issue, the technician resumed treatment.

I found this the most telling sentence...here the UX was so terrible the operator routinely chose to break the rules... but on some occasions it killed people


A lot of comments here seem to think having a hardware failsafe to backup the software failsafes is the key thing. In fact it doesn't matter about it being hardware. Hardware can fail too. The keys are 1) redundancy and 2) having multiple different failure modes. Adding hardware failsafe gives you both these, but you get the same level of safety by introducing any second failsafe that used a different method, including a separate software based technique as long as its failure mode is dissimilar and uncorrelated to the first. The best method is to add multiple different methods based on completely different technologies.


> The software consisted of several routines running concurrently. Both the Data Entry and Keyboard Handler routines shared a single variable, which recorded whether the technician had completed entering commands.

Raise your hand if you've made this type of mistake many times in the past. Most of us have the luxury of not having our software bugs affect human lives.

It's a shame they've removed the hardware safety controls. I don't think I'd even feel comfortable programming such a powerful tool without such circuit breakers.


I would guess the other infamous one is the Patriot missile timing bug.


A floating point bug [0] in the Patriot Missile Defense System killed 28 Americans and injured nearly 100 more when an Iraqi SCUD wasn't successfully countered during the Gulf War.

[0] https://www.cs.drexel.edu/~introcs/Fa10/notes/07.1_FloatingP...


The Therac-25 is part of the core curriculum in computer engineering, but I wonder if it's actually (in the grand scheme of things) hat bad of an incident. Compared with Facebook fomenting ethnic cleansing in Asia, the people who were hurt or died were very limited. Are there any new(er) examples which can show the dangers of a widely distributed, connected horror?


I'm not familiar with the Facebook incident you are referring to, but it doesn't sound like something caused by a bug? The article is about bugs; i.e. unintentional tragedies. If what you're referring to is not the result of a bug then it's off-topic (though perhaps not unimportant).


A descriptive video can be found here: https://www.youtube.com/watch?v=uEvu2PlDhO0


That video is terrible, read the article instead if possible.


Here's the link of (an updated version of) the original accident report: http://sunnyday.mit.edu/papers/therac.pdf




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: