Hacker News new | past | comments | ask | show | jobs | submit login
[dupe] Rules for Developing Safety Critical Code [pdf] (pixelscommander.com)
171 points by rrampage on Jan 16, 2015 | hide | past | favorite | 76 comments




The most impressive thing I've read about NASA's software development isn't mentioned here. It was the system where bugs and defects were (at least in principle) not treated as the failing of an individual but rather as a failing of the entire development process. So when a problem is found, the question isn't "Who did this and how can we punish them?" it's "What can we as a team do so that this never happens again?"

I don't know how this works in practice, but it sounds like a great system even when dealing with code in a lower-risk setting.


As a software person at NASA, one caveat to note is that this is in part due to the intricacies of the federal civil servant system. Frankly, it is very hard to punish or fire government employees or some prime contractors, so most often (even in non-software contexts) accountability is applied to a process rather than to an individual because it's a path of less resistance for management. Even when an individual approach would be more appropriate.


The other nuance here is focusing on process gives you a place to put less capable people that you're stuck with. You can create tightly defined QA tasks that are easily verified, give people a checklist, and keep them away from places where they can break things.


This is the standard in manufacturing techniques. A failure in quality control is considered failure of the system for allowing it happen in the first place, not the person making the mistake.

See: http://en.wikipedia.org/wiki/Poka-yoke

If a bug got through, and you don't test. Then obviously its because you don't test. Not because the programmer isn't superhuman, and doesn't make mistakes.

Everybody makes mistakes, design the system to handle it.


Related: root cause analysis - http://en.wikipedia.org/wiki/5_Whys



This ... is how it should be? Is this not obvious to everyone?

When push comes to shove, it doesn't matter who made a mistake, all that matters is that things didn't work. And that's all you really care about. The result.

People make mistakes. But systems let mistakes into production.


Bizarrely, I've found that most people agree this is obviously the better solution. And I have never seen it in practice (manufacturing/industry - not the shiney robotic kind, either). People often claim to want to fix the process, but it gets political fast, and because processes/systems are hard to engineer, blaming someone is just faster, easier, and cheaper.

Which is awful.


The stakes are different. If pushing an error to production means that an email doesn't go out or a $500 sale goes through, you have limited motivation, as the benefit of pushing out code fast is greater than pushing it out right.

If you have a billion dollar program where a failure will kill people and create a political storm that will get your management chain fired or sidelined, you can justify spending $100M on the process (ie lots of people + less ability) to push quality.


This is just an attribute of good leadership. It's applicable to teams in any context, and can't be enforced except by leaders who themselves are led by strong leaders.


The technique is basically an after action debriefing. In settings where people die from mistakes such as military operations and space flight, a political process which seeks to assign blame to individuals is often of little utility.

The actions of the deceased often contribute to their death. The only lesson from assigning blame to the pilot for pilot-error is "Try not to make mistakes". Placing the system at the center of the investigation makes for better checklists that reduce the probability of pilot error for many pilots.

"Don't wear red and march in a straight line" is a better tactic than "Don't get shot."


Actually, the use for that doesn't seem restricted to software in general. It seems like you could apply that to any team and get better results - for example, if a big sales deal falls through, don't punish the sales person, use the situation as a learning experience.

I try to do this with my team as much as possible. I am limited by my superiors, who don't hold the same beliefs about maintaining morale and intrinsic motivation.


It is also known as kaizen, with the Toyota Production System[1] as the best known example.

[1] https://en.wikipedia.org/wiki/Toyota_Production_System


I really can't think of a situation where it wouldn't work. Maybe if the codebase is extremely modular and each developer works mostly alone? Even then they should probably still have things like code reviews.


This, in theory, is what "QA" is. What everyone actually does, and calls QA, is usually just QC. Hardly anyone does QA right. Except NASA.


egoless programming


Blameless postmortems work phenomenally at Etsy which is a pretty low risk setting (after reading Sidney Dekker's book The Field Guide to Understanding Human Error [highly recommended] I would say that blameless post-mortems are even more important in a high-risk setting). Except failure isn't the correct word - the book makes the case that these are natural artefacts of complex systems.

1: https://codeascraft.com/2012/05/22/blameless-postmortems/

2: http://www.amazon.com/Field-Guide-Understanding-Human-Error/...


Nice article. Something it doesn't point out, which can become one of the elements in the cycle of name/blame/shame, is the creation of 'heroes'. This is the figure who throws themselves at the issue and brings it back from the brink of disaster. If the culture isn't fixed and failure easy to talk about, then this can play out repeatedly. Since it's usually a celebrated event ('we survived!'), it creates poor incentives for fixing the culture. One can also play this game by noticing a potential failure mode, figuring out how to fix it, but only leaping in once it's struck. I'd rather find another employer though.


We do this to pretty good success at the Big 4 I work at as well. The process by which these post-morterms get assigned can occasionally get political -- e.g. manager A dislikes or is trying to make a move on manager B so he pushes for a detailed post-mortem for every little issue that B's team has -- but the post-mortems themselves tend to work very well in exposing and correcting process failures without placing individual blame.


> All code must be compiled, from the first day of development, with all compiler warnings enabled at the compiler’s most pedantic setting. All code must compile with these setting without any warnings. All code must be checked daily with at least one, but preferably more than one, state-of-the-art static source code analyzer and should pass the analyses with zero warnings.

Wow, am I ever frustrated by the summary ignoring of warnings I see going on around me. I spend my spare time at work making warnings go away and deleting uncommented @SupressWarnings from our codebase.

In a word, KISS. Another engineer designing critical systems for the US government coined that one, apparently [1].

[1] http://en.wikipedia.org/wiki/KISS_principle


KISS needs to be taught, I really feel there should be a political movement around this principle.

Maybe one good exercise would be to ask programmers to write some program, and then let people vote on which code is the simplest to read and to understand.


Then there's Clang's struct alignment warnings using -Weverything. "Oh no, your struct will have 3 bytes of padding!" Made me turn off -Werror. Some warnings can be ignored.


It's shocking to see how often people write code like: memcpy(somePtr, &myStructObj, 12);

So while that warning can certainly be useless information, it does definitely catch bugs. Unfortunately, the people who write code like the above are also more likely to be people who ignore warnings. ;-)


Maybe you are being too specific in your example, and thus I am being nitpicky rather than helpful, but wouldn't that be caught by a static analyzer?


I just tried the following simple example:

  #include <string.h>

  struct S {
    int i;
    char c;
    int j;
  };

  void f(void) {
    struct S s1 = { 0, 0, 0 }, s2 = { 0, 0, 0 };
    memcpy(&s1, &s2, 9);
  }

  int main(void) {
    f();
    return 0;
  }
MSVC, Clang, and PC-Lint are all silent on the code (which is reasonable behavior since it's impossible to glean programmer intent from that snippet -- maybe the programmer really only wants the first nine bytes!). (Btw, yes, I am using the analyzer features for MSVC and Clang, not just relying on high warning levels.)


In some contexts that's actually an extremely useful warning.


Sure, but my point is that -Werror is not always a good thing. You have a struct with 3 ints and then you add a bool and the compiler decides to pad the struct up to the size of 4 ints. When you have twelve of those structs obeying the -Werror is basically saying "stop the train! There's 36 bytes of wasted memory on the tracks", which is kinda ridiculous. I think it should be a note, not a warning, but until then -Werror is kinda broken for me.


You can selectively disable that particular warning with '-Wno-xxxx'. That way you get the ease of -Werror and you can get rid of those warnings that get in your way.


Yes, that is one approach that may be useful for some more esoteric warnings.

All current compilers also have mechanisms to selectively ignore warnings via pragmas, like '#pragma GCC diagnostic ignored "-Wfoo"' in clang and GCC, which can be used to suppress instances of false positives of otherwise useful warnings.

https://gcc.gnu.org/onlinedocs/gcc/Diagnostic-Pragmas.html

http://msdn.microsoft.com/en-us/library/aa273936.aspx

IMHO not using -Werror is just inexcusable.


All warnings are an indicator that things are not going as planned - which means some aspect of the code is either unplanned/misplanned/misunderstood. A warning is a clue that things are not as they seem..


Of course, this is not as simple as it seems.

1 - Some tools and libraries (like GTK - yeah) like to throw unfixable warnings. "Yeah, this platform is missing this feature so let's throw a warning"

2 - Sometimes fixing a warning results in something much more complicated than the original code. (see for example half of PyLint warnings)

So, for that to work it's important the tool warnings are good (and at least with GCC on C they usually are)


When I did mission critical work we basically did not use libraries if at all possible. You need to know how everything works, and you need to be able to fix it. I'm pretty sure the NASA paper is written within that context.

But yes, that is a huge pet peeve of mine. I have all my QT header includes wrapped with pragmas to turn off warning because otherwise they spit out endless, useless warnings.


3 - a compiler adds a new warning that triggers on existing code.

So when you go to do a minor fix, do you also touch other code that now triggers a warning?


True. Tell me about it.

Or worse, they break existing code (I've seen this happen with C++ on GCC)


A related class of problems is code coverage with unit tests; I have a regular argument here as to why we "aim for 80% code coverage but in practice accept 50-60%"


Any recommendation of a good static source code analyzer for C++ ?


PVS Studio has always seemed to catch the most errors for me. It is by far worth the price if you s/w is at all important to you. But no tool is perfect. I also run cppcheck, Microsoft's VS (never liked that one much).

Haven't tried Clang because I am on Windows for my C++ stuff.


Updatable List of Open-Source Projects Checked with PVS-Studio: http://www.viva64.com/en/a/0084/


// Of course this is only an example, but except in rare cases when the value of a boolean is significant (e.g., interfacing code in different languages), don't compare to booleans! Not only is it redundant, it's another potential source of error.

if (!c_assert(p >= 0) == true) { return ERROR; }

// vs

if (!c_assert(p >= 0)) { return ERROR; }

// The following makes the error condition clearer, but it goes against the convention of having a hopefully true condition as the argument to assert.

if (c_assert(p < 0)) { return ERROR; }


I agree with you in principal, but in practice I work on a lot of C code that does not have a built in bool type (C89).

In your example using > or < there is not an issue because it's certain the return value is either 0 or 1. However, when checking a specific variable used as a boolean I prefer to see an explicit comparison.

if(variable)

versus

if(variable==TRUE)

I developed this preference after spending weeks on a particularly nasty bug. The bug was triggered by a corrupted int that was used as a boolean variable that passed a check because if(146134613) { kill_me_now(); } will run, even though the value had been corrupted. (of course, tracking down the source of the corruption was the real fix for that case, but there's no point in having safety checks that don't work)


Interesting example, but then you're at risk of creating code paths that vary depending on whether boolean values are true, false, or other, which is contrary to the definition of a boolean value.


You're right, but that is true because there is not a boolean type, not because of the form of the check.

Either form has the same problem since an int is used in place of a true boolean (true | false | other), but I prefer the form that makes it more explicit and noticeable.


This paper also explains each rule. They use code analysis tools a lot, and do not like stuff which makes analysis harder (recursion, unbounded loops..)


Recursion and unbounded loops are far too risky on their own, even ignoring the problems they cause for static analysis. The code that goes on NASA spacecraft is bombarded by cosmic rays on a daily basis, something a server sitting in a building at the bottom of the atmosphere doesn't have to worry about nearly as much. Even if the code is perfect, if a ray hits the right spot it's possible for it to flip a bit and then your perfect recursive function will do who knows what. [1]

They use error-correcting memory to combat this, but it's still better to be safe than sorry when your software controls a multimillion dollar space oasis with human beings inside who have no means of escape.

[1] http://www.statemaster.com/encyclopedia/Single_event-upset


So basically keep code simple, concise, and predictable.


I'm interested by the rule that says that loops must either be statically verified to terminate or statically verified to never terminate. Combined with the forbidding of recursion, it's an interesting approximation of a non-Turing-complete language. Or are guaranteed-to-be-infinite loops admissible in non-Turing-complete languages?


You are correct. Total languages are not Turing complete.

However, in the case of safety-critical software, we don't actually follow that rule perfectly. We usually have one or more main loops that never terminates, but everything called from there does terminate.

In languages that enforce totality, like Agda (which is not suitable for most safety-critical programming), you can either selectively disable termination checking on the main function, or you can recurse over a large number. I'm partial to 99999999999999999.


The rule about pointer is pretty limiting (no pointer function, no more than one level of dereferencing).

The reason is also pretty sad : "Pointers are easily misused, even by experienced programmers". I don't see how an experienced programmer has more chance to misuse pointers than anything else.


I believe this is because of the type of code they write. When writing real time systems you care about predictability of execution. When you allow function pointers, you (as a programmer / static analysis tool) suddenly have no idea what code will be executed, thus you can't reason about the time it takes. You could scour the codebase for all instances of code that set a particular function pointer, but then again it might be done indirectly, or - heaven forbid - based directly on external input. It makes code much harder to reason about.


This. There are categories of embedded system that don't allow reentrancy; it must be possible to represent the program's callgraph as a DAG with each function appearing exactly once. In some systems (ADA?) this allows for the preallocation of all local variables, too.

Some hardware (e.g. the smaller PICs) has poor support for pointers which can make indirection cost quite a lot of instructions.

Then there's the general principle of cutting your coat according to your cloth. You don't use a generalised pointer-to-rocket-engine system because you aren't going to add more engines at runtime


"You don't use a generalised pointer-to-rocket-engine system because you aren't going to add more engines at runtime"

Well, we might. Particularly if the vehicle you're modeling for a hardware in the loop sim isn't entirely specified at compile time.


The problem is really the redirection. The guiding principle of this document is wrapping every single functional block in a self-contained, statically-analyzable function.

If there is anything that pointers must be used for, then they must be used. But it is hardly controversial to say that pointers are one of the least-well understood and frequently misused aspects of low-level programming.


I wouldn't say pointers themselves are not well understood. The concept itself is painfully simple. The problem is that code that abuses pointer tricks can very quickly become poorly understood. For the type of software JPL writes, they'd rather sacrifice a bit of speed for easily verifiable correctness. My reading of this document is that they treat pointers like Java (and others, of course) treats references, conceptually.


Imho, it's less to do with "misusing" pointers at the micro level and more to do with the macro level effects of pointer dense code. Of course, if you're not allocating or freeing any memory after initialization, then you avoid many of the risks... but structural spaghetti that can hide bugs can still ensue.


Is it sadder than people dying?


> Typically, this means no more than about 60 lines of code per function.

Uncle Bob would be very upset by this statement. He advocates a maximum of 4-5 lines per function to ensure readable code in his Clean Coders video series. Anything more, and you need to refactor into smaller functions.


> He advocates a maximum of 4-5 lines per function to ensure readable code

Yeah, and you end up with myriad of functions which are used once and only once and it becomes impossible to actually find code which does something useful.


... and good luck naming all those functions.


See sibling comment about domain-specific language; there's nothing wrong with long function names if your autocomplete is working. Codebase I'm working with has names like "auditNullRecipeCursorAtEndOfOrderLineAdditionsIfRequired" and "getUnselectedSiblingWithSametabIndexOrCreateRepeatedChildIfLessThanMax"


It's hard to notice typos in names that long. Can you immediately tell where the typo is in auditNullRecipeCursorAtStartOfOrderLineAdditionsIfRequired ?


I can't tell at all. Fortunately, as long as someone hasn't declared another function with the typoed name, either Intellisense or the compiler will tell me.

I have a much harder time with "_tcscpy_s" and similar suites of functions where a single-letter typo is likely to be a different function with the same signature.


I changed ...End... to ...Start..., which could be a plausible real world bug.


I don't have a tool that relieves me of needing to read all those names, so not having to type them isn't that helpful.

Domain-specific languages are productive; their parts combine in ways giving novel results. A list of a million verbs you can only ever use for one purpose is not a language in that sense.


I've tried to apply this to my coding, maybe not 4-5 lines, but definitively keeping to short functions/methods.

You're right that you end up with a myriad of functions, may of which you only use once, but I was surprised how often I actually been finding myself reusing something. I have much more code reuse than I originally figured.

One thing that might be an issue, depending on your temper is that your code almost becomes a DSL. Programs are pretty much stringed together by this "myriad" of custom functions, that might not be useful outside this one program.


Perhaps Uncle Bob is engineering for the real world and applying 10 as the factor of safety. Space flight engineering runs more thinnly along the envelop. With the JPL Standard's requirement that all functions validate every argument, one line per declaration, and two assertions on average per function, six lines will disappear pretty fast.

Uncle Bob is preaching software engineering as engineering to the heathen masses. The JPL Standards are a sermon for the choir. Uncle Bob doesn't, so far as I know, advocate C for all critical code.


Yeah, he can be "upset" as much as he wants, he doesn't build robots that go to other planets.

How about we stop blindly accepting whatever UB says, most of what this man says is crap, really


That was a loose statement in the book, 60 lines might be long, but in extreme cases, it's ok.


A bit surprised it doesn't mention the use of curly braces. Perhaps they assume their "state of the art" static code analysis tools will find potential issues (like Apple's goto fail failure).


This documents is a summary. There in much more detail in the JPL Institutional Coding Standard for the C Programming Language : http://lars-lab.jpl.nasa.gov/JPL_Coding_Standard_C.pdf


Perhaps due to the context of this link, I read "Apple" as "Apollo"


The PDF file seems to be timestamped 2014-12-26. Does any-one happen to know the exact authoring date? Sadly, there is no date given in the document itself.


The PDF is based on the article ''The Power of Ten -- Rules for Developing Safety Critical Code,'' in IEEE Computer magazine, June 2006, pp. 93-95

Source: author's website http://spinroot.com/p10/


atsaloli already gave a good answer, but I'll add that the date is in the PDF's metadata. fex, poppler provides pdfinfo which shows the creation date of 2007-01-15. Your viewer probably has an option to show it too.


Indeed, you're right. Even Apple's Preview app on my box can do that :)


I would gladly print that PDF and stick it in every office. Maybe an alternative would be to make it work, and then to make it readable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: