>This all has to happen without any human intervention, so the central computer ...

mi3law · on Aug 12, 2018

There was a discussion on HN a while back RE NASA's software safety guidlines. Here is the link to the discussion: https://news.ycombinator.com/item?id=12014271

The PDF linked to in the discussion is no longer there, but I found it on standards.nasa.gov here: https://standards.nasa.gov/standard/nasa/nasa-gb-871913

There are also some interesting product management related guidelines from NASA, like this from 2014: https://snebulos.mit.edu/projects/reference/NASA-Generic/NPR...

GimbalLock · on Aug 12, 2018

Hello! I’m a FSW dev at NASA Langley. As others have said, the talks from the FSW workshop are a great start. If you want to see a well-used framework, check out CFS (https://cfs.gsfc.nasa.gov)

throwawaymath · on Aug 12, 2018

Off topic, but I've always been interested by the way that government agencies almost exclusively choose acronyms for their software. Meanwhile private companies (especially in the last decade or two) almost always choose unrelated, single words.

It initially seems kind of ridiculous to me that everything has an acronym, but I suppose it's no more ridiculous than choosing a name that sounds like a Pokemon. Maybe less so.

In any case, thanks for sharing that.

btrettel · on Aug 12, 2018

https://ntrs.nasa.gov/search.jsp?R=20100024508

> The development and verification of the Charring Ablating Thermal Protection Implicit System Solver (CATPISS) is presented. [...]

Not sure industry would try this one either, though it is very memorable.

saalweachter · on Aug 12, 2018

There are a lot of really good links, but to be honest 99% of the secret to writing bulletproof code is “write the most simple, boringest program you can”.

Which is not to say that what NASA and its contractors do isn’t cool or that they don’t spent ungodly amounts of time and money on testing and verification, but you also don’t load one line of code more than is absolutely necessary onto a machine that absolutely must work at all times.

It’s an important lesson to learn and a good skill to exercise from time to time, but honestly it’s also something that doesn’t apply to most of our work as software engineers. For most software most people are willing to knock a couple of nines off the reliability of a piece of software in exchange for higher-quality output, lower costs, and more features. If my data analysis pipeline fails one time in ten because an edge case can use all the memory in the world or some unexpected malformed input crashes the thing but yields more useful output than if I kept it simple and hand-verified every possible input, well, that can be a fine trade off. If your machine learning model for when to retract the solar panel occasionally bricks and leaves the panel out to be destroyed, that’s less acceptable.

reaperducer · on Aug 12, 2018

you also don’t load one line of code more than is absolutely necessary

Coincidentally, I spent the weekend banging around with an old TRS-80 Model 100, and it's been very interesting to see what workarounds and compromises were made to conserve space.

For example, the machine ships with no DOS at all, so if you're working with cassettes or modem only, you don't have that overhead.

If you do add a floppy drive, when you first plug it in, you flip some DIP switches on the drive and it acts like an RS-232 modem, and you can download a BASIC program from the drive into the computer that, when run, generates a machine-language DOS program and loads it out of the way into high memory.

I don't have one of those sewing machine drives, so I went with a third-party DOS, which weighs in at... wait for it... 747 BYTES.† An entire disk controller with command line interface in 2½ tweets.

† http://bitchin100.com/wiki/index.php?title=TEENY.CO_MANUAL

firebacon · on Aug 12, 2018

The part that I find the most intriguing is "corrections can be made on the fly".

I can see how you would ensure reliability through proper requirements specification, a good software development process, separate independent implementations and extensive verification.

However, every time I read a popsci article about space flight software, they talk about this capability to push new code to the spacecraft while it is in flight.

I'm really curious to learn what this looks like in practice (technical details). Do they really have the ability to do an "ad-hoc" upload and execution of arbitrary code on these systems? If so, how are the ad-hoc programs tested and verified?

marsRoverDev · on Aug 12, 2018

There is usually a piece of software running on the machine which basically just does this - allows you to command an image upload to the SSD, do a checksum of the file, then install it if all goes well. There is also usually a simpler version of the software on a redundant SSD or partition which the onboard computer will install if it detects that the software that is currently installed is malfunctioning.

My understanding is that some spacecraft launch with beta/alpha equivalent software. Correct me if I'm wrong, but I believe that the rovers do this, with simple software installed first, then more complicated versions installed once they know everything is working.

It's somewhat similar to updating your iphone, but instead you use a huge dish to do the transmission and the bitrate is pretty horrendous.

I'm going to need a definition of "ad-hoc" here; no-one "deploys straight to production" on a spacecraft. Any patches have to be thoroughly tested on simulators and models of the spacecraft on earth before they are transmitted.

firebacon · on Aug 12, 2018

Thanks for the reply! So what you're saying is that it's just a "normal" over-the-air software update. I.e. you add some new functionality and then do a full system test of all functions of the software before replacing the entire image?

That makes sense, but is almost a bit disappointing. After all, that is exactly how it works for the boring systems here on earth. From various wired & co articles I had the impression that there was possibly something more; a mechanism that would allow users to send elaborate "commands" to the spacecraft to perform "ad-hoc" tasks at runtime. (What I mean by "ad-hoc" tasks are tasks that are unknown at the time of validation/testing of the software.)

marsRoverDev · on Aug 12, 2018

Yes, we can send commands to the spacecraft once it's up there to do thing like modify memory or hard drive contents directly, turn on/off or command payloads and equipment. The full list is pretty exhaustive - anything you could want to be able to do, you can command manually. These things aren't 100% autonomous (though they have autonomous elements in the software).

There is also a way to send pre-programmed task lists to them which are executed sequentially, with delays if necessary.

That kind of thing is in the hands of operations, so it's not usually the job of the software team to test in the normal manner.

firebacon · on Aug 12, 2018

Very interesting! Is this kind of command capability (e.g. ability to modify memory contents) something that is usually only available on "non-critical" subsystems, or would you generally expect to also find it on critical components, like the communication or navigation modules?

marsRoverDev · on Aug 13, 2018

The ability to modify memory contents is pretty much universal; you can modify things like eeprom contents, the RAM, hard drive, etc. There is no differentiation between critical and non-critical; it's all just fairly critical.

Ground won't send telecommands to a spacecraft to modify a piece of memory without knowing exactly what they're doing first.

Swizec · on Aug 12, 2018

It makes sense when you think anout it. These missions last months even years before a specific piece of software becomes useful because it’s for a specifi part of the mission.

Wouldn’t you want the benefit of those extra months to perfect the software?

marsRoverDev · on Aug 12, 2018

I think that in some cases it's simply due to the launch schedules being more optimistic than reality; the hardware has to be done, but the software development doesn't necessarily have to stop once you launch the thing.

natep · on Aug 12, 2018

In this case, the "corrections on the fly" refer to all of the real-time responses that the software makes without ground involvement. In the case of a solar limb sensor detecting the sun, the probe will abandon its data collection for that near approach, and go into an emergency response that has been made as straightforward and deterministic as possible, to maximize the chances of recovery for all single-fault and some double-fault scenarios.

To answer your question about software upload, the PSP has 3 redundant CPUs (primary, hot spare, backup spare), and each has multiple boot images. To upload software, the team uploads it to an inactive image of the backup spare CPU, promotes it to hot spare for long enough to collect the data it needs, reboots it into the new image, and then rotates it into the primary role, which is a seamless transition unless something goes wrong, and then the new hot spare takes over again within a second. Once they're sure the software is working, they can update the other CPUs. Before any of this, new software is tested on identical hardware set up on the ground with physics simulations.

See also, "Solar Probe Plus Flight Software - An Overview" from http://flightsoftware.jhuapl.edu/files/_site/workshops/2015/

gdubs · on Aug 12, 2018

http://www.flownet.com/gat/jpl-lisp.html

firebacon · on Aug 12, 2018

Thanks - that was exactly the kind of info I was looking for!

Amazing that they had the ability to just run ad-hoc LISP on the spacecraft. It appears their method to ensure safety in the face of arbitrary code execution was to divide up the spacecraft into isolation zones and run the parts that have a REPL on a non-essential CPU. From [1]:

> To protect the main DS-1 mission from possible misbehaviors of RA, the design included a “safety net” that allowed the RA experiment to be completely disabled with a single command, issued either from the ground or by on-board fault protection.

[1] https://ti.arc.nasa.gov/m/pub-archive/176h/0176%20(Havelund)...

SomeHacker44 · on Aug 12, 2018

From previous articles, remote updates seem to be a core part of spacecraft software/operating systems. I even recall one situation where a spacecraft had a REPL built in that was used to fix a problem (slowly) remotely! They also have multiple levels of operation and watchdog functionality. I have no direct experience with that beyond following news about spacecraft.

firebacon · on Aug 12, 2018

Remote updates -- where you replace a full (sub)system -- are one thing, since you can always run the normal software validation procedure on the new version of the software. So an OTA update of a system (even in flight) does not sound like rocket science (yet)...

But: Once you include a REPL or another mechanism to push and execute arbitrary code "ad-hoc", I wonder how that could possibly be tested an validated? Surely as soon as you add the ability to run arbitrary code, there is no way of testing for all possible states of the system as part of the validation process?

In other words, how do you allow the user to push arbitrary code, but prevent them from putting the spacecraft into a condition from which it can not be recovered? The only way I could naively think of would be to only allow the user to push code to a completely isolated CPU that has a remote-reset functionality from the main/comms CPU.

Still, the popsci articles I read made it sound like there might be more to it. It would be excellent to find some first-hand accounts/sources on how this looks like in reality.

CompuHacker · on Aug 12, 2018

One of the outer solar system probes had a radio that had enough built in logic to accept software updates and maneuvering commands independently of the two redundant on-board science computers.

Lights-out management indeed.

Aaargh20318 · on Aug 12, 2018

Here's an article about it I read a while back, interesting read: https://www.fastcompany.com/28121/they-write-right-stuff

marsRoverDev · on Aug 12, 2018

This redundant software and hardware setup typically isn't necessary when humans aren't involved. The space shuttle system is similar to what you will find on a Boeing or Airbus aircraft. Redundant software, written by different people in different countries with completely different cultures in different languages (on purpose), running on multiple machines with different hardware and voting on the decisions to be made.

It is complete overkill when "all" you're going to lose is a robot and some pride, as with a space probe you want to have lots of features and this level of safety is very restrictive on development effort.

More than likely, the spacecraft in question is written in C or C++ with the help of RTEMS or VxWorks. It is probably running a radiation hardened, very slow processor.

rtkwe · on Aug 12, 2018

They don't do 3x calculations and voting but they do often have redundant computers they can switch over to in case of failure. Curiosity had to switch to it's 'B-side' computer back in 2013 when A-side had a memory issue. Even when not carrying humans it's still a billion/million dollar mission that probably wouldn't be replicated for a while if ever (within the researchers life times at least) that could be scuttled by a softwer bug.

If anyone is interested JPL publishes their code standards doc for C: https://lars-lab.jpl.nasa.gov/JPL_Coding_Standard_C.pdf

kevin_thibedeau · on Aug 12, 2018

Most spacecraft have some form of redundancy to guard against single point failures. It's a waste of money to send up failure prone hardware. Amateurs building cubesats, probably not, but the big players aren't going to take that sort of risk.

marsRoverDev · on Aug 12, 2018

You are right, they have redudnancy in all cases - but it isn't usually software written by multiple teams with different hardware.

xtian · on Aug 12, 2018

Talk on JPL's software for the Curiosity rover: https://www.usenix.org/conference/hotdep12/workshop-program/...

bitexploder · on Aug 12, 2018

Their cost per line of code is also, pardon me here, astronomical. That quality has a cost most shops cannot stomach.

noselasd · on Aug 12, 2018

There are many good videos from the yearly Flight Software workshop http://flightsoftware.jhuapl.edu/

corerius · on Aug 12, 2018

also, I would imagine that there would be a strong bias towards reuse...which leads you to long term standardization of not just language but also CPU architecture.

GimbalLock · on Aug 12, 2018

Hello! FSW dev from NASA Langley here. We do try to do reuse as much as possible, but small satellites (CubeSats) are starting to change that. There are so many new pieces of hardware and so much experimentation going on to see what’s feasible in space. There are new RTOS frameworks being developed both by commercial and government (CFS, F-prime). If you’re interested in this in particular there is a conference called SmallSat which hosts the talks from previous years. https://smallsat.org