Hacker News new | past | comments | ask | show | jobs | submit login
Some disasters caused by numerical errors (tudelft.nl)
102 points by davidbarker on Feb 1, 2015 | hide | past | favorite | 22 comments



Doolittle Raid is probably the biggest disaster caused by wrong date calculation. But it is practically unknown in hacker community.

It was first bombing of Japan in 1942. Bombers were supposed to be refueled in China. But they crossed international date line and arrived day earlier. In result airports were not ready and bombers crashed.

15 bombers crashed, 3+8 crew man died. Soviets got American airplane technology.

> Planners in Washington, DC, also had made a ridiculous blunder, forgetting that the ships would lose a day (April 14) crossing the International Date Line, putting the planes in China a day earlier than anyone expected. Because of this mix-up, when some of the bombers flew over Chuchow Field, which was supposed to have been their main refueling base, an air raid alarm sounded and the lights turned out.

https://books.google.com/books?id=FkEeVAf-U7gC&lpg=PA43&ots=...

http://www.americainwwii.com/articles/the-impossible-raid/

http://en.wikipedia.org/wiki/Doolittle_Raid


The wikipedia article doesn't mention a date miscalculation. Probably the main cause of their fuel difficulties was having to launch 10 hours and 170 nautical miles before originally planned due to being spotted by a Japanese patrol boat. All but one of the planes crashed, but only three men KIA, 8 men captured, and one crew interned in Russia.


Surely the date error would go the other way? If the Americans are in Hawaii and it's (say) 1 February, in the landing zone in China it would be 2 February, not 31 January. So if a mistake were made and they had said "the attack is on 1 Feb, be prepared", the Chinese strips would be ready too early, wondering where the Americans were, and then do a little math to figure out they'd be needed the next day instead.

Right?


The raiders ran out of fuel because they had to launch early because the taskforce was spotted by a Japanese patrol boat.

The Soviets didn't get US bomber technology from the Doolittle raid, they got it from emergency landings of B-29s in Soviet territory during the strategic bombing of Japan in 1944:

https://en.wikipedia.org/wiki/Tupolev_Tu-4


During the first Iraq war, the partiot missiles was presented as a great success that prevented Israel and Saudi Arabia to be hit by Scud missiles. The facts was very different.

Not a single Patriot missile manage to hit a Scud during that war, partly because of the bug mentioned in the article and because anti missile defence is really hard.

The reason the Scud missiles did not do more damage, was that Iraq had modified them to extend their range, and this modification made them unstable, and thus their accuracy really poor, so that they would almost certainly miss the target. On the other hand, as mentioned on the web page, one of the Patriot missiles that missed a Scud did hit something else, while the Scud itself missed.

All in all, the missile defence caused more damage than if no defence had been installed at all. This fact was only revealed several years later, and did not reach the headlines.


For anyone interested in such matters (actually, every engineer should be to some degree), a seemingly neverending (yet tenderly curated) stream of similar interesting observations and stories is posted to:

"The RISKS Digest, [or] Forum On Risks To The Public In Computers And Related Systems":

http://catless.ncl.ac.uk/Risks/

A summary from Wikipedia https://en.wikipedia.org/wiki/RISKS_Digest:

"It is a moderated forum concerned with the security and safety of computers, software, and technological systems. Security, and risk, here are taken broadly; RISKS is concerned not merely with so-called security holes in software, but with unintended consequences and hazards stemming from the design (or lack thereof) of automated systems. Other recurring subjects include cryptography and the effects of technically ill-considered public policies. RISKS also publishes announcements and Calls for Papers from various technical conferences, and technical book reviews"


To pre-empt someone bringing up THERAC-25: although it is a very famous case of software disaster, none of the root causes were numerical errors.


I can sympathise with the Patriot Missile floating point issue. A bug like that would be next to impossible to track down and address, especially when dealing with specialised embedded hardware.

The Ariane 5 issue however is pants-on-head retarded. Converting a 64-bit floating point number to a 16-bit int is something a first year computer science student would be embarrassed about.


It is more subtle than that. The original code ran on the Ariane 4 rocket, and there the engineers had proved that the error could never happen with the first 80 seconds of the trajectory of that rocket, which was the period this code ran. The management of the Ariane program decided that the unit was going to be used in Ariane 5 as well, without certifying it for the new rocket. The Ariane 5 rocket is much more powerful, and will get much further in the trajectory than Ariane 4 in that time, so an angle big enough to get overflow will happen. This was never discovered, because the trajectory of the Ariane 5 was never tested with the code. Thus it can also be considered a management failure.

The code was also for stabilizing the rocket on the launchpad, and made no sense after that, but it was not shut down before after 80 seconds.


yes, since it was an overflow checker that caused the problem, it's a philosophical issue too.

Let's take a hypothetical: You're flying a rocket on a one-time mission. The rocket is not reusable and there are no redundant engines or any way to abort the mission in an intact way. You then detect an overflow in your control algorithm.

In practice, it almost never makes sense to do anything to these errors. If the error was spurious, the best course was to not do anything. If it was for real, the mission will be lost anyway so it doesn't make sense to spend effort to pay attention to the error.

Your only abort criteria might be if your rocket starts venturing to a path that will cause it to fly out of its designated safety zones.

However, if you have redundancy, then doing stuff like shutting down engines starts making sense (Like on Saturn V or the Space Shuttle).


The Ariane 5 failure is a little more nuanced than that though. The conversion bug was in the Ariane 4 too, they reused the IRS, so it was almost "proven in use" but Ariane 5's trajectories put it under much higher acceleration (enough to trigger overflow).

It's an embarrassing failure for sure, but it's more nuanced than a "dumb mistake"; the management, testing methodology, and code all failed when it exploded.


A first year CS student would have caught it, however, when you're dealing with the embedded world, sometimes you have to do things like convert 64 bit floats to 16 bit ints to get things to fit in the 2k or so of ROM you have to work with, or to run in a reasonable amount of time. The problem lies more in the blind reuse of code, and a lack of documentation as to the code's constraints. Something like that code, I'd have commented to the effect that it only works up to x meters per second and/or put something in my own code that if the value exceeds some certain amount, to return MAXINT or a value like that. Again, yes it's ugly, but given a the environment, sometimes you have to hold your nose while writing out the code.


I think it is way more likely that they never read the documentation (or at least not all of it) than that there was no documentation.

Also, that 'clip to MAXINT' choice can be a very bad choice, too, so it would have to be documented and that documentation would have to be checked before any reuse of the code in environments with the constraint that the code cannot fail. Because of that, I cannot see how that choice would help to prevent such accidents.


> Converting a 64-bit floating point number to a 16-bit int is something a first year computer science student would be embarrassed about.

Try enabling compiler warnings on a legacy system sometime and let me know how many of those embarrassing errors you find :)


Probably something along the lines of...

    (...)
    writing legacysystem_firmware.hex (387123 bytes)

    Build finished successfully, with 71245 warnings.
Good look, finding the "unsigned short x = floatval;" line, that doesn't even trigger a warning in the first place ;-).


There are other examples provided by the delightfully named European Spreadsheet Risks Interest Group on their "Spreadsheet Horror Stories" page: http://www.eusprig.org/horror-stories.htm


The number was larger than 32,768, the largest integer storeable in a 16 bit signed integer, and thus the conversion failed.

The largest 16-bit signed integer is actually 32767... although it probably would not have mattered in that case, it's a little ironic to find an off-by-one error on a page about numerical errors.

One event that comes to mind, not really a disaster, but a rather costly mistake caused by units confusion, is this: http://en.wikipedia.org/wiki/Gimli_Glider


They may have used Excess-K representation... https://en.wikipedia.org/wiki/Signed_number_representations#...


Nuclear weapons and forgetting parens (in Perl)

http://www.foo.be/docs/tpj/issues/vol2_1/tpj0201-0004.html

Also, Castle Bravo was ~3x bigger than predicted because of a failure to correctly model the tritium production from lithium-7.

https://en.wikipedia.org/wiki/Castle_Bravo#Cause_of_high_yie...


This can't be real.. 'PERL itself stood for "Precision Entry and Reentry Launchings"'



The system design failure in the A5 is interesting to read, the full report is at

http://www.ima.umn.edu/~arnold/disasters/ariane5rep.html

Basically they tried to re-use software from the A4 in the A5. The laser gyros have firmware that can get funky and the onboard computer would kind of "trust but verify" the laser gyros on the A4. Obviously the programmers of the laser gyros and the programmers of the nav system are not the same people and had conflicting views on what makes a decent variable type for horizontal velocity. The A5 doesn't technically need to verify the lasers alignment and its of no use after liftoff anyway (so whaddya intend to do if it gets misaligned anyway, land, fix it, and take off again?) The A4 guys saw that pointlessness of it and stopped monitoring after 40 or so seconds (by then either its working great or you've already crashed...)

So the A4 being somewhat anemic never hit the limits of that size int for horiz velocity. But the A5 which was kind of like the A4 after extensive steroids use was able to max out or roll over the conversion routine.

The nav computer wasn't very well designed at yet another level and instead of going all RTOS and recovering in a few milliseconds it helpfully kernel panic'd. And the kernel panic error message (probably something like "Oh oh" in French) was helpfully interpreted by the engine computer as if it were commands and slammed the engine nozzles over full hard one direction, which the airframe couldn't survive (real big rockets can't pinwheel in flight, at least not in one piece)

So the list of software design failures is epic and huge

1) Not every minor math exception NaN should kernel panic and fail the whole system (shouldn't have to crash because one temp sensor got unplugged, etc)

2) Never do stuff you don't have to. If you don't need to align your lasers due to technological advances, then stop doing it, because you can't crash the whole system trying to do it if you never try to do it. Or, if, 50 seconds into flight, you're commanded to slam your engine nozzles to 90 degrees perpendicular to flight and you know the hardware can't bend that far anyway, I think you can safely assume the nav computer has crashed and ignore it for awhile (maybe the watchdog timer would have eventually saved them?), or rephrased if you're commanded to do something that'll certainly blow the ship up, unless you're a destruct charge, maybe you should ignore it and just keep on chugging along till you hear something saner.

3) Most epic system failures are at the border of subsystems. Like where one machine talks 16 bit ints and the other 64 bit floats. So minimize your borders.

4) Simulation could have saved half a billion bucks. I'm actually kinda surprised nobody ever tried this. I guess test driven development wasn't so cool back then.

One way to look at the design failure is numerous times they failed when when handling protocol failures. I've seen this in big bureaucracies time and time again, who cares if the overall mission fails as long as our little group did right by our own code and someone else can be blamed. The nav guys never should have asked us laser gyro guys our horiz speed if it was over 15 bits signed so its all their fault for asking. The nav guys never should have asked us engine computer guys to point my engine nozzle sideways thus snapping off the back of the rocket so its all their fault. The OS guys never should have asked us math coprocessor guys to convert -40000.123 m/s into a 16 bit int so its all their fault. The math guys should never have thrown a kernel panic on a mere numerical conversion crashing the OS so its all their fault not ours in OS land. The OS guys should never have run our nav routines on a survivable recoverable non-RTOS system so its all their fault. Management never let us software devs run a ground test to save money so its all their fault. The blamestorming could go on for pages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: