There's always someone who brings this up whenever Ada is mentioned, as if it se...

nwallin · on Sept 11, 2019

> The actual 'bug' was caused by an integer overflow occurring due to the velocity variable not being wide enough to handle the higher horizontal velocities of the newer rocket. This caused the rocket to go off course and require termination.

This is not entirely correct. The code which overflowed was unnecessary for flight. People have this idea that the sensor sensed it was going 32745 m/s to the east one moment, and -32723 m/s the next, and then tried to compensate for the sudden gust of wind it must have encountered. This is incorrect. The routine in question was only supposed to run when it was on the ground, and served no purpose in flight. It was scheduled to be disabled 40 seconds after launch. (The flight failed after 37 seconds)

The problem is specifically the fact that the code was written in Ada. Ada does not fail the way other languages fail when you overflow. Instead of given an incorrect value (which never would have impacted the flight) it throws an exception. This exception was uncaught. This uncaught exception brought down the entire flight control system, despite the fact that the falling code was not controlling flight.

Rust improves upon this situation by panicking in debug mode and wrapping in release mode, which is better than the Ada behavior in every possible way. Normal "unsafe" languages improve this by giving an incorrect value as the result, which in the case of Ariane V would have saved the flight.

ajxs · on Sept 11, 2019

You could have condensed your entire post to just "This exception was uncaught". Your last point is just outright ridiculous:

> "Rust improves upon this situation by panicking in debug mode and wrapping in release mode, which is better than the Ada behavior in every possible"

So... What you're trying to tell us is that Ada recognising an exceptional situation has occurred is bad, and that Rust just ignoring it is a good thing? Righto... An integer overflow occurring in this scenario is an exceptional condition. In this case it was one that the developers did not anticipate. This issue arose because they mistakenly thought it was not possible for this scenario to occur. The mistake here was not anticipating the exception and handling it in a meaningful way, not the raising of an exception. The fact that not raising an exception would have avoided the disaster in this case is entirely incidental. It was still an exceptional circumstance that the developers, testers, designers did not anticipate. The silent wrapping behaviour you describe would have bitten someone eventually, perhaps here, perhaps somewhere else. Possibly with enormous repercussions.

> "Normal "unsafe" languages improve this by giving an incorrect value as the result, which in the case of Ariane V would have saved the flight."

The normal "unsafe" languages you're referring to ( presumably C ) have a history of very poor error handling. Requiring an extremely high degree of conscientiousness when programming for mission-critical applications. Can you imagine actually programming for what you're describing here? Not only does the computer have to guard against outright failure, but now it has to infer the possibility of an error state from a stream of in-band data in real-time. Righto...

Also, just so you know: Signed integer overflow is UNDEFINED BEHAVIOUR in C.

kragen · on Sept 11, 2019

Tainting the dataflow rather than the control flow, like a quiet NaN, would be a better solution in cases like these. It happens that the solution that was chosen was the worst choice possible, but in other cases it wouldn't have been.

Probably the fact that the engineers got the case analysis wrong suggests that the methods of reasoning they were using weren't very effective. Modern formal methods might or might not have helped; guaranteeing that a piece of code is statically free of exceptions is not a terribly advanced thing to do, but really you'd like to guarantee that the variable was actually big enough. Having less code to reason about certainly would have helped.

C’s choice is clearly unconscionable.

blub · on Sept 11, 2019

In hindsight it would have been perhaps better, but I don't think it can be said that having essentially random values in memory is better than having a controlled error handling action being executed. Is there any reference that such a course of action is preferred or beneficial?

kragen · on Sept 11, 2019

We have a lot of experience with floating-point qNaN values since IEEE 488 introduced them in 1985, and more rigorous languages than Ada — such as OCaml, Haskell, and Rust — use a generalized version of the concept for a lot of error handling.

I don't think there's an open-and-shut case that error values are "preferred or beneficial" in general with respect to exception handling through nonlocal control-flow transfers; there are reasonable arguments on both sides, and it seems likely that both are viable options in most circumstances, and which is better depends on those circumstances. (For example, lazy evaluation or catastrophic consequences attached to software crashes would seem to me to weigh heavily on the side of using error values, while interactive REPL-style usage would seem to me to weigh on the side of nonlocal control flow, perhaps with Common-Lisp-style restarts.)

My point was narrower: in this particular case, using error values analogous to NaNs, rather than shutting down the SRI, would have saved the rocket, without posing the risk of using garbage data that C's semantics pose. In this particular case, they would have been better. It's possible that, in other cases, perhaps even in other places in the SRI firmware, using error values would have been worse than unwinding the stack or shutting down the SRI. But in this case they would have been better.

(If you aren't convinced of that narrow point, you should probably read the report from the Lions board of inquiry.)

blub · on Sept 14, 2019

I think I see what you mean. I'm personally interested in generally applicable rules, so the fact that it could have worked in this case doesn't help me decide what to so when designing another system.

In the general case, the Ada behavior of detecting and signaling numeric errors seems to be the only robust choice to me. The alternative could only be acceptable if it's proven that the algorithms fail safe in the presence of tainted data.

kragen · on Sept 15, 2019

I'm interested in generally applicable rules, too; in this case, the generally applicable rule is that there is no generally applicable default robust behavior—the behavior of detecting numeric errors and crashing the system was the worst possible behavior in this context, but in other contexts it would be the best possible behavior. As I explained in https://news.ycombinator.com/item?id=20935662, there was no acceptable way to handle such an error if it was detected at runtime. The only acceptable solution is to rigorously verify before runtime that the error is not going to happen, as Rust does with, for example, multithreaded data races. (Well, you can't rule out hardware malfunction, but if you can make it sufficiently improbable you can use the strategy that they did in fact use.)

However, default-ignore vs. default-explode isn't a difference between handling errors with IEEE-754-qNaN-style dataflow-tainting and CLU-style nonlocal control flow transfers, as you seem to think. In either case, algorithms that need data that could not be computed successfully will fail to get that data; in either case, you can write the code to either propagate that failure or to attempt to recover from it somehow. In either case, you will not be able to control the spacecraft if you needed that data to control it. The difference I was mentioning, which is not generally applicable, is that in unusual cases like the Ariane 5 disaster, where the erroneous data shouldn't have been computed in the first place because it wasn't needed, qNaN-style error handling clearly results in safer behavior.

But, in general, real-time control is not a place where there exists a "robust choice" when your software is broken. If your jet engine control software or your antilock braking software is broken, you're likely to die, and no error handling strategy can prevent that, at least not at the level we're talking about. Biological systems manage this situation by having a melange of somewhat-independent control loops using different algorithms that are constantly fighting for control of the organism; while this does seem to produce systems that are more resilient than engineered systems, it's less efficient and enormously more difficult to develop or understand, and it's the opposite direction from the rigorous certainty you and I favor.

carlmr · on Sept 11, 2019

> Rust just ignoring it is a good thing?

Rust tries to give you zero-cost abstractions where possible. Overflow checks aren't zero-cost, therefore they're disabled in release mode. Otherwise when safety can be done at zero-cost (at runtime) Rust will usually do that.

I agree with you though that the exception might have helped in many other situations, especially if a simulation catches it early on. I would still say from gut feeling that ADA makes the safer decision on average.

Fabien_C · on Sept 11, 2019

I don't see the point with "zero-cost abstractions", overflow check is not an abstraction.

And of course the run-time checks can also be disabled for Ada of course, and usually are for release build.

johnisgood · on Sept 11, 2019

And this is worth noting:

> These runtime checks[1] are costly, both in terms of program size and execution time. It may be appropriate to remove them if we can statically ensure they aren't needed at runtime, in other words if we can prove that the condition tested for can never occur.

> This is where the analysis done by GNATprove comes in. It can be used to demonstrate statically that none of these errors can ever occur at runtime. Specifically, GNATprove logically interprets the meaning of every instruction in the program. Using this interpretation, GNATprove generates a logical formula called a verification condition for each check that would otherwise be required by the Ada (and hence SPARK) language.

[1] overflow check, index check, range check, divide by zero

---

So again, I do not see why Rust would shine (as some people suggested) here, Ada/SPARK can statically ensure that your program is correct and eliminate all sorts of runtime errors, including overflow.

steveklabnik · on Sept 11, 2019

The rust compiler also removes bounds checks and such if it can statically prove that they won't occur. You don't have as much tooling to communicate it to the compiler as you do in SPARK just yet.

When I learned Ada (blog post somewhere in this thread) I was pretty shocked by how many more runtime checks it had than Rust does, overall. Rust usually checks things at compile time.

OneWingedShark · on Sept 16, 2019

>When I learned Ada (blog post somewhere in this thread) I was pretty shocked by how many more runtime checks it had than Rust does, overall. Rust usually checks things at compile time.

That's perhaps an oversimplification; elsewhere it's been said of Ada's mentality that "Incorrect Is Simply Not Allowed." — but there's ALWAYS been a preference for pushing checks from dynamic to static, and from runtime to compile-time.

As a trivial example, the following code is typically generated without any sort of index check because the loop-control variable takes its range from the Array, it's obvious that it CANNOT be an invalid index, and this is allowed by the language reference manual (and encouraged by the annotated reference manual)—

    For Index in Input'Range loop
        Input(Index):= 0; -- Zero the array.
    End loop;

Conversely, there are places where you cannot statically determine the validity:

    Value : Positive := Positive'Value( Get_Line );

johnisgood · on Sept 19, 2019

More information about Get_Line: https://blog.adacore.com/formal-verification-of-legacy-code

kragen · on Sept 11, 2019

Mostly I agree, but the engineers and managers weren't just accepting the Ada language defaults here; they had consciously decided, after clearly not enough testing, that the best thing to do if there was an unhandled error was to shut down the SRI, because it was probably a hardware failure. So using Rust, which I do agree is much better than Ada and provides safety guarantees Ada doesn't even dream of, wouldn't have saved them; they would have chosen to panic.

Also Ada does have wrapping arithmetic available. It just isn't the default.

Ono-Sendai · on Sept 11, 2019

Wow. that's quite shocking. A similar situation lurks in many C++ programs, with uncaught exceptions threatening to crash the program at any time.

kragen · on Sept 11, 2019

> It's not some software bug caused by Ada.

This is not correct. It was, in fact, "some software bug caused by Ada", in the sense that in most other programming languages, the error that did in fact happen would have been harmless.

For those who are interested, the full report is at http://www-users.math.umn.edu/~arnold//disasters/ariane5rep..... There is no need to "guess" when a fairly detailed public analysis is available, particularly when the issue is one that comes up repeatedly. It is true that the problem was that a horizontal velocity variable ("horizontal bias", BH in the code) overflowed, although it was not the ordinary kind of integer overflow, but rather a floating-point-to-integer-conversion overflow, which does not produce an exception in any commonly-used programming language.

The missing piece of the explanation is that the variable in question was no longer being used at that point in the flight, so if the failure had been allowed to pass silently rather than shutting down the SRI ("inertial reference system"), the rocket would not have been destroyed.

Of course, under some circumstances, allowing variables to have grossly incorrect values in your inertial reference system could have been equally disastrous — that's what you assumed happened. The system was designed in the way it was because the assumption was that such errors would almost surely stem from hardware failures rather than software bugs; perhaps that is a "domain issue caused by a failure in management and testing".

But in some sense, you can argue that any organizational failure is a "failure in management", since management's job is to look to windward and steer the organization clear of failures by any means necessary. But in fact the particular failure in management that happened was for the software to be inadequately thought out, such that one of Ada's hidden pitfalls caused Ariane 5's maiden flight to fail.

ajxs · on Sept 11, 2019

Forgive the use of "from memory" in my original post. I was on the move and didn't have time to reference the original material in-depth. You're correct. You've given a great explanation of the accident. Whenever this is brought up, I feel the need to rebuke the insinuation that this accident was explicitly caused by the use of Ada. You are correct in that the accident was the result of an exception caused by an overflow, something that might not have triggered an exception condition in another language. However, I don't think the fact that a language like C might never have realised there's a problem at all to be damning of Ada. No matter what language this was developed in, better formal analysis of the problem domain should have highlighted the possibility that this could occur. Especially considering the different hardware involved.

ternaryoperator · on Sept 11, 2019

This conversation is everything I love about HN. Thoughtful, informative conversation, correction without offense--and so we all benefit. Thank you guys and thanks to dang for keeping this such a great forum!

kragen · on Sept 11, 2019

Well, I probably could have corrected more politely, too, but I appreciate that ajxs suffered fools gladly in this case, the fool being me.

Also as long as we're buttering up the mods, don't forget sctb!

kragen · on Sept 11, 2019

I agree.

phkahler · on Sept 11, 2019

>> he actual 'bug' was caused by an integer overflow occurring due to the velocity variable not being wide enough to handle the higher horizontal velocities of the newer rocket.

Something all that fancy type system should have prevented. I found it interesting in the Hackaday piece that the C code xample called out the size of an integer while the Ada version did not. The Ada version caught an implicit conversion, but apparently does nothing for overflow.

Every language has it's good points and bad. I still want to learn more Rust. In fact I recall reading recently that some Ada developers want to adopt something akin to Rust's borrow checker. All I know is that C++ is not safe. How well does Ada work with C libraries? Rust is pretty OK at it.

thesuperbigfrog · on Sept 11, 2019

Ada / C interop is very easy to do.

https://learn.adacore.com/courses/intro-to-ada/chapters/inte... shows how Ada's Interfaces.C package and the Import and Export aspects are used to share data and make function calls between Ada and C.

ajxs · on Sept 11, 2019

Ada is miles ahead of Rust when it comes to C interoperability. I posted this before in one of the other Rust threads, but you can't get more than two paragraphs through the official Rust FFI documentation ( https://doc.rust-lang.org/nomicon/ffi.html ) before being asked to install a third-party dependency. The 'Interfaces.C' package is defined in the Ada language specification. And, as others have mentioned, Ada's language defined pragma directives, or Ada2012's aspects provide a way to implement FFI.

roca · on Sept 11, 2019

There's nothing wrong with pulling in the `libc` dependency. It literally means adding two lines of code and then `cargo` manages everything for you. It isn't really third-party; the owner is "rust-lang:libs".

I understand that people are conditioned to think APIs are easier to consume when they're in the standard library, because that's true in many languages, especially those that lack a modern package management system (e.g. Ada). But it's not true in Rust and there are advantages to having libraries like `libc` outside the standard library. For example `libc` is updated more frequently than the Rust standard library (multiple times per month vs every six weeks).

kragen · on Sept 11, 2019

Having libc updated multiple times per month sounds like a fucking nightmare to me. What am I missing about this situation?

ajxs · on Sept 11, 2019

I actually laughed at this comment. I was wondering the same thing. Although this is a cheap shot, how could I resist mentioning that Ada's FFI functionality is updated every 20 years on average and is still ahead of the curve.

OneWingedShark · on Sept 16, 2019

> There's nothing wrong with pulling in the `libc` dependency.

Yes, there is. Not only is dependency transitive: you now depend on everything libc depends on — but now you're depending on the correctness AND properties of the dependency, to include security.