Infamous Software Bugs

joezydeco · on Nov 18, 2015

Every software engineer should be aware of the RISKS digest (aka USENET's comp.risks).

The mailing list and digest just celebrated it's 30th birthday in August. There's an incredible wealth of history here regarding bugs, related issues, and the overall risks of our dependency on computers in society.

RISKS broke most of the major stories of the day including the Morris Worm, the AT&T Long Distance collapse, the Mars Climate Orbiter unit error, the Mars Pathfinder priority lockup, the Therac-25, the first email spam, the Pentium FDIV bug, and thousands of other interesting, amusing, and/or scary bugs.

http://catless.ncl.ac.uk/Risks/

anc84 · on Nov 18, 2015

I had never heard of that, thank you so much for many cozy winter reading sessions.

joezydeco · on Nov 18, 2015

This is one of my favorite nuggets in the digest, it's from November 1988. It's Clifford Stoll writing at 3:45 am about his discovery of the Morris Worm. The Morris Worm was literally the first internet virus captured in the wild.

http://catless.ncl.ac.uk/Risks/7.69.html#subj1

Warning - this one email will lead you to multiple rabbit holes (Stoll, The Cuckoo's Egg, Robert Morris, etc).

https://en.wikipedia.org/wiki/Clifford_Stoll

https://en.wikipedia.org/wiki/Morris_worm

randlet · on Nov 18, 2015

The Cuckoo's Egg is a really fun read!

Piskvorrr · on Nov 18, 2015

Well...I'd say that https://en.wikipedia.org/wiki/Therac-25 has become much more notorious than the Arianne launch; but it's also a UX fault, not a straight-out software error.

derrickdirge · on Nov 18, 2015

An unhandled race condition disfigured and killed human beings. That earns it a spot at the top of the list IMO.

Zenst · on Nov 18, 2015

Agree and also would of thought they would have a sensor on patient that measures dosage independently of the system giving the dosage. But from what I can tell - they do not.

Someone1234 · on Nov 18, 2015

How would you sense that without risking either deflecting the beam (which could cause harm/cancers in other body parts) or reducing the beam's intensity (which could cause treatment to be less effective).

This is one of those ideas that sounds a lot easier than it likely is.

randlet · on Nov 18, 2015

Putting things in the way of the beam to measure dose is actually quite common.

You can do real time dose monitoring using diodes (see [1]). The diodes are quite small and don't generate huge amounts of scatter.

It is quite common to use TLD's[2] to monitor patient dose which is then analyzed offline to give dose estimates after the fact.

edit: that said, I don't think anybody was doing real-time in-vivo measurements during the era of the Therac incident.

[1] https://www.aapm.org/pubs/reports/RPT_87.pdf

[2] TLD=thermoluminescent dosimeter, little hunks of metal whose electrons get stuck in an excited state when exposed to radiation, you then heat them up and measure the light they give off as they relax back to ground state. The amount of light they give off can then be correlated with the dose the patient received.

rprospero · on Nov 18, 2015

My assumption is that you'd put the sensor behind the patient. After the beam has been through the body, you don't care if it's deflected or reduced.

Obviously, it's too late at that point to prevent the beam from hitting the patient, but you'll know that something went wrong and can lock the system until the problem is found.

randlet · on Nov 18, 2015

Many (most?) modern linear accelerators have imaging panels behind the beam path (see image [1]) which could be used for dose monitoring although they are primarily used for imaging purposes. I'm not sure whether any of these are currently used for dose monitoring.

[1] http://tommytoy.typepad.com/.a/6a0133f3a4072c970b0147e2ed7f8...

FiatLuxDave · on Nov 18, 2015

Yes, EPID panels are indeed used for dose monitoring and quality assurance of the dose delivery to the patient [1][2]. However, this is relatively new in the industry and is not universal. The fact that EPIDs have become so common (since Varian and Elekta like to offer them as standard) means that this will be used more as time goes on.

As an aside, TLDs and IVD systems are falling out of favor for patient dose monitoring because AAPM TG 62 (referenced in the link in your other post) is not directly applicable to IMRT and VMAT modalities, which are pretty common these days.

[1]http://www.sunnuclear.com/medphys/patientqa/epidose/epidose....

[2]http://www.sunnuclear.com/snc_site/solutions/patientqa/perfr...

randlet · on Nov 18, 2015

Thanks for clarifying :)

p.s. If you are employed in Med Phys allow me to make a quick plug for my open source Med Phys QA database project: http://qatrackplus.com/

Zenst · on Nov 18, 2015

Well as most dosage is focused, you just increase that area to include the sensor. WIth such focused exposures you have shielding for the patient for the parts you do not wish to radiate and with that just place the sensor upon that and not impacting dosage upon patient at all.

That's how I'd do it from my understanding of the usage practices today.

randlet · on Nov 18, 2015

Shielding a patient like this isn't really feasible. The half-value-layer of lead in a 6MegaVolt clinical beam is 17mm [1]. That is, even with nearly an inch of lead shielding in place you'd still be delivering a large dose of radiation to healthy tissue.

[1] https://quizlet.com/4622164/hvl-layers-flash-cards/

dkbrk · on Nov 18, 2015

Yep, also I would say that Heartbleed is definitely deserving of a place.

forgottenpass · on Nov 18, 2015

Many more-interesting bugs than Heartbleed have existed. If Heartbleed is notable, it is largely as a study in the success of branding and marketing campaigns for security bugs. Or being the straw to break the camel's back on openssl cleanup efforts.

basseq · on Nov 18, 2015

I agree, Heartbleed is a big miss. Before reading the article, the two that came to mind were Y2K and Heartbleed. Notwithstanding all the recent security breeches (e.g., Ashley Madison, Sony) that could be attributed to "bugs".

gherkin0 · on Nov 18, 2015

If you're limiting it to the "The 5 Most Infamous Software Bugs in History," Heartbleed is definitely out. It just happened to be recent, so it has an outsized place in people's minds. Including it would sort of be like those foolish "Top 100 bands of all time" lists that have the Beatles at #1, followed by 85 bands from the last 15 years.

As others have noted, the big miss is the Therac-25 bug, which is pretty commonly taught in Software Engineering/Computer Science classes the example of how intangible software can kill people.

Gravityloss · on Nov 18, 2015

Knight Trading's 10 million dollars per minute bug is a good one too:

http://dealbook.nytimes.com/2012/08/02/knight-capital-says-t...

kriro · on Nov 18, 2015

I'd add the Morris Worm to the list, probably one of the most influential ones: https://en.wikipedia.org/wiki/Morris_worm

laumars · on Nov 18, 2015

Which was written by one of the Y Combinator founders (Robert Tappan Morris)

pjmlp · on Nov 18, 2015

Which happened to be targeted to UNIX systems.

MasterScrat · on Nov 18, 2015

> a 64 bits variable can have a value of −9.223.372.036.854.775.808 to 9.223.372.036.854.775.807 (that’s almost an infinity of options)

almost ;-)

derrickdirge · on Nov 18, 2015

9 quintillion rounds to infinity.

koliber · on Nov 18, 2015

Just shy :)

aaronbasssett · on Nov 18, 2015

I would add Therac-25[1] and the 500 mile email[2] to that list.

[1]: http://www.ccnr.org/fatal_dose.html [2]: http://www.ibiblio.org/harris/500milemail.html

klodolph · on Nov 18, 2015

The 500-mile email was a configuration error. The software was working exactly as intended.

aaronbasssett · on Nov 23, 2015

I would say that setting any missing configuration value to zero is a bug.

teddyh · on Nov 18, 2015

See also COMPUTER-RELATED HORROR STORIES, FOLKLORE, AND ANECDOTES:

http://wiretap.area.com/Gopher/Library/Techdoc/Lore/rumor.ne...

manaskarekar · on Nov 18, 2015

I came across this recently, really interesting list with further links

Software Horror Stories : http://www.cs.tau.ac.il/~nachumd/verify/horror.html

rwhitman · on Nov 18, 2015

It's a shame the list ends in 2004. I imagine the last decade's worth of software-related horror would be quite interesting

ozim · on Nov 18, 2015

I would add Knight capital thingy on top: http://money.cnn.com/2012/08/09/technology/knight-expensive-...

thedaydreamer · on Nov 18, 2015

Wait a second, it says they had to destroy mass climate orbiter because the development and underlying software used different metric system ?

It's bit hard to digest. ( Although just checked wikipedia, it also says so ) How can a high performance organisation like NASA could make such a simple yet fatal mistake ?

Wikipedia page of Mars Climate Orbiter says that NASA was informed about this discrepancy by two people, but the "concerns" were dismissed.

What am I getting wrong here ? These are not the "concerns" you simply dismiss in a space mission. Could there be another story to this ?

icegreentea · on Nov 18, 2015

So I went ahead and read the MCO Mishap Phase 1 report (linked here: http://www.icics.ubc.ca/~cics525/handouts/handout_MCO_report...) and I'm having a hard time finding something that backs up the wiki summary of two navigators raising concerns and having them dismissed.

The report does go ahead and state all sorts of organizational (and otherwise 'soft' issues) that contributed to the end failure.

The report notes that earlier deviations between measured and modeled results were noted, however, they were hampered by limited data (in the sense that they couldn't measure what they wanted). It is implied (though not stated) in the report that in the absence of appropriate data, the operations navigation team attempted to contain/mitigate the deviations instead of 'solving' it.

The report also notes substantial organizational issues. Different navigation teams were used in development and operations, and there were insufficient knowledge transfer during hand-off that hampered the operations navigation team ability to notice these issues. Communications between the main operations team and the ops nav team were not effective. They were apparently spatially separate teams. In addition, model-measurement conflicts which were brought up were solved via e-mail instead of over formal processes. The report suggests that systemic use of formal processes may have allowed teams to uncover the problem earlier in time.

And of course, the report also states that insufficient verification/validation of the supplied software was not completed. The entire section on verification/validation (MCO Contributing Cause No. 8) is just a giant cringe fest.

The implication is that the MCO project was just... not run well.

thedaydreamer · on Nov 19, 2015

First - Great job finding this report. Thank you for that.

So had a look at the report.

There was one more problem actually. This machine, the MCO, had asymmetrical solar panels which would cause solar pressure ( force by sunlight ) to create a very mild spin ( angular momentum ). Now this angular momentum had to be desaturated time to time in order to keep this machine stable. Now, one module called SM_FORCES calculate this adjustment and feeds to AMD ( Angular Momentum Desaturation ). Now, this SM_FORCES & AMD uses different unit system, which was ignored by whoever wrote this connecting piece of program. Due to this error desaturation was not enough ( or more ) and it kept building over the period of 9 months.

Now, I notice that NASA has a separate team to navigate this machine to mars. There data showed this angular momentum adjustment event occurred 10-15 times more than expected. It was like a man walking with one leg shorter than another. It's a 9 months journey from mars to earth. They must have seen the first sign to inconsistency with in first few weeks only, just guessing though.

In this report, out of 8 possible contributing causes, at-least 3 are attributed to navigation team. I think success of such mission depends not only on meticulous planning but also on thinking on the feet ability of the team. ( Any Apollo 13 fans? :) )

Albright · on Nov 18, 2015

"They" didn't destroy it; it was destroyed by the Martian atmosphere.

…At least, that's the official story. I recall there being a lot of conspiracy theory-like buzz at the time from people who also couldn't believe NASA could make such a stupid error like that. It does make you wonder.

cafard · on Nov 18, 2015

Given that the performance of the Patriot missile was much over-hyped in the first Gulf war, should #3 be on there? Do we know that the missile would have intercepted the Scud if launched?

therapix · on Nov 18, 2015

The Israeli army wrote a report about the error and a patch was uploaded into the US systems the day after the attack. Bad timing I guess. After day, the Patriots never missed their target and didn't need to be rebooted.

"e. Two weeks before the incident, Army officials received Israeli data indicating some loss in accuracy after the system had been running for 8 consecutive hours. Consequently, Army officials modified the software to improve the system's accuracy. However, the modified software did not reach Dhahran until February 26, 1991--the day after the Scud incident."

jandrese · on Nov 18, 2015

The article does offer a suggestion as to why the Patriot was so inaccurate. Hitting a missile means you need to get within a few meters of the target, and the conversion error would exceed that after just a few hours of operation. It also gives a plausible solution to how it escaped testing: The tests would likely have been run on freshly powered up systems.

ghostDancer · on Nov 18, 2015

Of the recent ones the Debian OpenSSL bug with the not so "random" number generator.

himlion · on Nov 18, 2015

Small copy/paste error on the php section:

There are two ways to run a PHP app on nanobox. You can eitherconfigure the generic ruby engine, or use a framework specific engine.

mahouse · on Nov 18, 2015

Goddamn Wordpress brought to its knees in a few minutes.

anc84 · on Nov 18, 2015

The 6 Most Infamous Software Bugs in History

rplnt · on Nov 18, 2015

Cached version: http://webcache.googleusercontent.com/search?q=cache:rRmDAkC...

danburgo · on Nov 18, 2015

Great article, but IMO it missed one of the most important bugs of all, the one with the Hubble Telescope and its mirror

OopsCriticality · on Nov 18, 2015

How does that qualify as a software bug? Perkin-Elmer's custom null corrector was misaligned so the mirror was figured into the wrong shape. Edit: if anything, it was an organizational failure—PE chose to ignore other measurements that showed the mirror was the wrong shape.

http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/1991000...

rogeryu · on Nov 18, 2015

If I wasn't one myself, I would say that the human is the biggest bug of all.

vlunkr · on Nov 18, 2015

Ha, I've never seen that BSOD clip before. They handled it really well.

raynjamin · on Nov 18, 2015

Ping of death wasn't big enough?

raynjamin · on Nov 18, 2015

No ping of death reference?

jordache · on Nov 18, 2015

no mention of Windows 3.1 calculator's 3.11 - 3.1 bug?

AshFurrow · on Nov 18, 2015

You yanks and your aversion to the metric system...

chapium · on Nov 18, 2015

The scapegoat for what was clearly an organizational failure. Why was the system using multiple units of measure. Did it pass navigation tests prior to launch? Was the test flawed?

VLM · on Nov 18, 2015

I think OP was making a subtle joke about titling it "5 bugs" but providing metric 5 aka 6 in the article to cause a buffer overflow in the article itself. Which the buffer overflow in the article the 7th bug. Personally I think a recursion fail would have made a funny additional article bug, but buffer overflows are funny in their own way too. Or a picket fence / off by one error would have been funny like iterating from 1-5 to output the bug list where the bugs are enumerated beginning at zero... so why didn't we see bug 0 and the crash at bug 5 would have been pretty funny.

The story is normie clickbait anyway, and most of the bugs aren't mismatches between the source code and the (possibly non-existent) unit testing infrastructure, they're just cultural examples of blaming the lowest social status individual involved, that usually being a programmer. There was a programmer involved, someone in management screwed up and doesn't feel like taking the blame, therefore its the programmer's fault. In the olden days they'd just have blamed the closest (insert ethnic group here) or (insert religious group here), nothing really to be proud of.

fromero · on Nov 18, 2015

There are a lot of bugs more infamous than this...

this page is bullshit.

dang · on Nov 18, 2015

It was arguably a weak article for HN, but instead of commenting like this, you'd do much better to tell us about some of the more infamous bugs. Then we'd all learn something—or at least some of us would—which is why people come here.