Software Update Destroys $286M Japanese Satellite

walrus01 · on May 9, 2016

I know this was a science and LEO (low earth orbit) satellite, but here's something to consider when thinking about the difficulties of engineering hardware & software to work together in a satellite.

Geostationary telecom (and weather, and SIGINT, etc) satellites: In the entire history of manned spaceflight no human has ever visited geostationary orbit. Once placed in a geostationary transfer orbit (about 450 x 36,000 km) and then onwards to final geostationary orbit, nobody will ever see a satellite again. The largest ones weigh 6000 kilograms and are the size of greyhound buses. They're out there right now operating with multiply-redundant everything and cost $150 million to build and launch. It is highly unlikely that any time in the next 25 years any human will ever visit one in person or touch one. When a satellite is encapsulated in its fairing/shroud for launch that is the last time anyone will ever see it until its ultimate end of life in hopefully 15 years. Every one of its control systems needs to be so thoroughly debugged and multiply redundant that it can operate out there with absolutely zero chance of repair or parts replacement.

colemickens · on May 10, 2016

Presumably these satellites are transmitting data back to Earth? Is that channel not bidirectional? If those satellites can be remote reprogrammed, are they really much different than satellites in geosynchronous orbit? Or are you suggesting that satellites in geosync orbit could be manually updated via EVA? I would think that would still be extremely cost-prohibitive (not to mention a huge risk to human life for a weather satellite! Has that ever actually happened?

walrus01 · on May 10, 2016

I'm saying sort of the opposite - there have been very, very rare cases where satellites in LEO were visited by humans and repaired or retrieved, but it's extremely uneconomical (Hubble, LDEF, etc). Actually cheaper to build and launch a new one. But at least it is technologically possible. Whereas nobody has ever gone to geostationary orbit.

There have been proposals for ion engined orbital tugs to grab onto old, out-of-stationkeeping-propellant satellites that still have good electronics, for the purpose of extending their life, or moving them to a new orbital position. But nothing has actually flown.

cmcginty · on May 10, 2016

The moon is beyond geostationary orbit, and I think we've been there ;-)

ptaipale · on May 10, 2016

Particularly, if you don't want to land on the Moon, just go round it and return, then it's easier - you don't need that much propulsion to get back.

But to get to geostationary orbit and stop there, you need rocket power. To get back, you would also need power.

walrus01 · on May 10, 2016

We've been to the moon but humans have never put a manned craft in geostationary orbit, or returned from it - the delta-V requirements are different.

misnome · on May 10, 2016

Right, OP wasn't saying that we had never been _beyond_ GEO, but never been to the orbit - going through is a lot easier than stopping off partway.

ams6110 · on May 9, 2016

A certain well-known electric car maker (maybe all of them, though) can push software updates to systems that have lives in their control not to mention costly machines. A sobering reminder of the need for very careful testing and control over this sort of thing.

marssaxman · on May 9, 2016

If the programmers of a multimillion dollar satellite can't guarantee that their patches won't break anything, I'm sure as hell not willing to risk my life on the possibility that an automaker's programmers can. No remote updates for me; if it is my car then it is my car, and I will decide if, when, and how I will upgrade it. If a carmaker won't accept that, I won't drive their car.

danielrhodes · on May 9, 2016

It stands to reason it will eventually happen, but compared to a satellite there is a much lower liklihood that as an individual you will experience harm because of such an update.

Some mitigating factors:

- they can do slow releases. So if a serious anomaly does occur it is unlikely to be with your car.

- the conditons of space make errors much more catosptrophic. Although I can imagine similarly catosptrophic scenarios with a car, there are also more opportunities to safely rectify the situation (like the car pulling over and refusing to drive if things are not ok). In space, the satellite cannot just stop and wait for a tow truck to arrive.

- there is only one satellite to test the update on. If they screw it up, it's gone. With cars, there are many to test on, so errors will turn up and be fixed faster.

But back to the issue of risk:

Just like with drunk driving, you are at risk of somebody else's car going haywire as well, so it's a risk regardless of whether you yourself updated or not. At that point, the difference between your existing risk and the risk you take on from the update is probably not that great, so might as well just do the update.

marssaxman · on May 10, 2016

It's the loss of control that bugs me. When I buy a car I am not buying an ongoing relationship with the car's maker; I'm buying a physical object, which is then mine, and I'll choose when and how it will be serviced. The over-the-air auto-update thing is as though... well, it's like some mechanic from the automaker shows up at your house unannounced a year after you took your car off the lot; without a word, they haul your car away on a trailer for an hour, then bring it back and park it. Ta-da, hope you like what we did to "your" car, because it's done now whether you wanted it or not!

danielrhodes · on May 10, 2016

When that car is a computer on wheels, you kind of do want that relationship. I imagine an open source car will one day be a thing, but like the personal computer, it's going to be a niche market.

marssaxman · on May 10, 2016

Well, perhaps, but I'd rather just... not have a car that is a computer on wheels!

Tycho · on May 9, 2016

You know this could give the FOSS movement a new lease of life. Realistically only nerds care about whether they can install/configure exactly what they want on their computer hardware, but when it comes to safety things like this many more people are going to care deeply.

rconti · on May 10, 2016

2016 is the year of Linux on the dashboard!

hyperbovine · on May 9, 2016

Is it possible that satellites are more complex than cars?

IMTDb · on May 9, 2016

Or that car maker actually have the ability to test their software in real-life situation before pushing the update to the intended client

ganeshkrishnan · on May 10, 2016

If it's a self driven car then an controlled environment with hundreds of cars with new patch will test for couple of weeks before the patch is released. Chances are mitigated and the cars can be observed physically unlike satellites.

Also every technological advancement we are reducing fatality. We are not bringing it to complete zero but we are definitely reducing it.

mathattack · on May 9, 2016

There's also many more cars to test software on.

mathattack · on May 9, 2016

By this logic - you wouldn't want to drive any car since no satellite programmer could guarantee no bugs in their initial software either.

I think it's a tradeoff. If self-driving cars can save tens of thousands of lives in the US every year, how many should we be willing to risk to enable this?

Eventually your insurance company may force this on you due to higher premiums, or government may make injuring someone in a manual car accident a crime like assault.

marssaxman · on May 10, 2016

Perhaps. I'll worry about such things after self-driving cars actually exist. Right now they are an engineer's fantasy, or a salesperson's exaggerated way of describing semi-automatic assistance features; who knows what the reality will eventually be.

nsgi · on May 10, 2016

It's not only self-driving cars that use software.

marssaxman · on May 10, 2016

Self-driving cars only exist as research prototypes. They are as speculative as flying cars.

Tesla is the only offender I know about when it comes to real cars that actually exist and have obnoxious over-the-air update systems, because normal car companies are constrained by their contracts with dealer networks, so any software updates they might want to push on us would have to be done through the normal vehicle service program. This can easily be avoided by choosing a different mechanic. One can't very well choose a different mechanic if you are stuck a Tesla-like system that pushes diffs at you via radio, and that's what concerns me.

elevensies · on May 9, 2016

I think they already have the ability to push upgrades during routine service, so the only difference here is the frequency. I don't know what their testing regimes look like, but it may not even be different.

mmaunder · on May 9, 2016

Which certainly makes vulnerabilities more interesting.

askafriend · on May 9, 2016

"Oh you discovered a life threatening vulnerability that affects every car in the deployed fleet of electric vehicles?

Thanks, here's $5k for your hard work as part of our bounty program."

beeboop · on May 9, 2016

I know this is a joke, but I am unsure if you're pointing it directly at Tesla. To clarify, Tesla does not have a bug bounty program that I was able to find for their car software.

askafriend · on May 9, 2016

Haha no, I was just poking fun at high profile bug bounty programs like Facebook/Google's which routinely offer pennies for very very serious vulnerabilities.

I'm not in a position to discuss bug bounty ethics or judge what the right thing to do is given the market circumstances, I'm only in a position to make an opportune joke.

But you gotta admit...it is amusing seeing very critical vulnerabilities like ubiquitous user login being reported by researchers whom get 5k or 10k for doing the right thing. These holes could instantly cripple the trust of a company's userbase and yet the people reporting them get pennies.

bpchaps · on May 9, 2016

Yeah, I can attest to that. Found a github page with ssh keys, home directories, infrastructure scripts, configs, etc for Comcast that could've been used to do just about anything within their infrastructure. I raised the issue as ethically as I could, particularly since someone with an "infosec background" already forked it a bit before I found it. The engineers who I got in contact with were hugely appreciative, got the issue fixed almost immediately (seriously, kudos again), and brought up the possibility of starting a bug bounty program with me as a pilot of sorts. Cool, right?

Well, after talking to their CISO, it sounds like no such thing will happen. Her reason was, "It's not actually a 'bug', so it's not going to be included in a bug bounty program". So there goes that idea.

I really don't get it. I could have just as easily sold everything in there for big monies, or could have personally done havoc of my own. There's a certain point where "thanks" just doesn't feel enough, especially after the bug bounty comment.

melvinmt · on May 9, 2016

Look at it this way, you could also have been "rewarded" with a 30-year jail sentence. Please don't mess with big corps.

bpchaps · on May 9, 2016

Your attitude is so amazingly fear based - fuck that sauce.

What makes you think I'm actually messing with big corps? I'm not. I disclosed the issue properly and safely. I have a good lawyer who would step in to help if something bad happened. But hell, if something bad did happen, where I can't get out of a sentence like that, then fine. It is what it is.

But letting that fear of 30 years of prison prevent me from disclosing something that could have long term effects for millions? Fuck. That.

I wasn't even doing any of this for money. I found it, I wanted to correct it. That's it. But for their CISO to drop a giant deuce on me like that without as much as a "Thank you" from her? Heh. It's a little annoying.

melvinmt · on May 9, 2016

Have you ever heard of a thing called the "Computer Fraud and Abuse Act"? Even the act of disclosing a vulnerability to the company itself can be misconstrued by paranoid big corps as a "security breach", hence the possibility of a 30 years sentence. I'm just saying it's not worth the risk. At least not in the US.

bpchaps · on May 9, 2016

I'm well aware of what the risks are. It was something from github, so it's not like I was doing anything crazy, anyway. During my disclosure, I told them the kinds of port scans I did, and the types of individuals I shared the information to. As full disclosure as I possibly could have been.

I'd rather not fall into this trap: https://theintercept.com/2016/04/28/new-study-shows-mass-sur...

lawnchair_larry · on May 10, 2016

That isn't possible under the CFAA.

sangnoir · on May 10, 2016

> Your attitude is so amazingly fear based - fuck that sauce.

I really think this part:

> > I really don't get it. I could have just as easily sold everything in there for big monies, or could have personally done havoc of my own.

would have earned you a prison sentence (if caught). The most obvious ways to 'profit' from security bugs for corps with no bug bounties are illegal.

bpchaps · on May 10, 2016

Right. I'm just sharing my experience at this point, though. Can't get caught if I'm not doing anything to get caught. And again, I went through the proper, ethical channels to get this raised.

TeMPOraL · on May 9, 2016

If bpchaps would mess with Comcast directly then maybe; if he sold the knowledge somewhere else than not really. I don't recall hearing of any case where a hack was traced all the way back to the third party that found and sold the vulnerability.

67726e · on May 9, 2016

Well considering it's Comcast, I'd be more then happy to "accidentally" disclose those vulnerabilities.

TeMPOraL · on May 9, 2016

Yeah, it's one of those companies that are really asking for it. In some cases it's hard to do the ethical thing :).

bpchaps · on May 9, 2016

The thought crossed my mind. ;)

But being able to say "$GOODTHING happened because of me" felt so much better. It just means in the end that 50% of my meals are lentil based :P.

ars · on May 9, 2016

But small corps are OK?

lobotryas · on May 9, 2016

I'm not seeing what point you are trying to make.

What other option do these researchers have? Try to sell the vuln on the black market (illegal) or to a state actor (unethical and likely illegal)? Keep quiet about what they found and not get any money/recognition?

Companies are under no obligation to pay and researchers are under no obligation (except ethics, I guess) to turn over their findings. By having some non-trivial payment the companies are encouraging people to provide cheap sec audits for them.

TeMPOraL · on May 9, 2016

> What other option do these researchers have? Try to sell the vuln on the black market (illegal)

That's probably what GP is alluding to. They do have this option and I can imagine they could get much more money this way, with little or no way to trace the source of a 0day back to them.

> Companies are under no obligation to pay and researchers are under no obligation (except ethics, I guess) to turn over their findings. By having some non-trivial payment the companies are encouraging people to provide cheap sec audits for them.

Well, of course. The question is, whether the amount companies chose to pay is enough to get most people to report vulnerabilities instead of selling them elsewhere, or if those companies are just putting a lot of trust in the strong morals of security researchers?

hrehhf · on May 9, 2016

Is it actually illegal to sell information that a particular bug exists and can be exploited? What if it is sold to the company which owns the software? What if it is sold to one's own government?

sigmar · on May 9, 2016

Well it is "pennies" + public recognition. I am also of the opinion that small bounties are much better than no bounty (cough apple cough)

Titanous · on May 9, 2016

Tesla does have a bug bounty program: https://bugcrowd.com/tesla

beeboop · on May 10, 2016

I specified they didn't have one for their car software, but I missed the part of that webpage that does actually list "vehicle" as a potential target. I was thrown off by the "This program is focused on Tesla's public facing web application" statement. So you're correct.

sqeaky · on May 9, 2016

From the article:

> High energy particles may have disrupted the onboard electronics.

If this is a problem Humanity has a bigger problem.

Humor aside I agree defects in devices that can rapidly become two ton death machines must be taken seriously. There must be safe guards at every level.

walrus01 · on May 9, 2016

Or, "how not to drop a satellite on the floor":

https://www.spaceflightnow.com/news/n0410/04noaanreport/

Terr_ · on May 9, 2016

The wonderful thing about software is that a large number of our "checklists" can be automated.

Unfortunately, the checklist of checklists to automate is still manual :p

walrus01 · on May 9, 2016

"We've put our automated checklist for site deployment on the internal mediawiki page at https://site.goes.here , problem is that Bob was trying to rack the 85 pound, 4U server containing the hypervisor that runs the wiki instance, and he dropped it on the floor from a 6' height"

B1FF_PSUVM · on May 9, 2016

Glance through ... "The MIB found such violations were routinely practiced." ... what?

Backtrack, backtrack ... "the NOAA N-PRIME Mishap Investigation Board (MIB)" ... ah, OK.

(Men in Black movie, if that was before your time. Where the checkout line tabloid rags - clickbait before its time - were the research reports. ;-)

krapp · on May 9, 2016

> "the NOAA N-PRIME Mishap Investigation Board (MIB)" ... ah, OK.

That's just what they want you to think.

walrus01 · on May 9, 2016

and definitely not an SNMP MIB

hcrisp · on May 10, 2016

Thanks for the link. Earlier, I imagined this while watching the James Webb Space Telescope rollover [0]. Now I know it can really happen.

[0] https://youtu.be/PhGfgREoBj4

sandworm101 · on May 9, 2016

Honestly, this was the best place for this to happen. Better it tip onto the floor than something fail on lift-off or while in orbit. A sat in the garage is worth 1/4 of one in orbit.

walrus01 · on May 9, 2016

Probably wouldn't have happened during launch, though, as the mechanism attaching it to the rotational base would've been replaced with a special pyro platform rigged with explosive bolts to separate the satellite from the final carrier stage. Hopefully the people who rig the explosive bolts are a bit more thorough with their checklist.

https://psemc.com/products/cartridge-actuated-devices-cads/e...

r721 · on May 9, 2016

Previous discussion: https://news.ycombinator.com/item?id=11602536

mizzao · on May 9, 2016

This seems to follow a general pattern in today's devices: expensive hardware ruined by crappy software.

Why is it that good software quality is generally an afterthought in so many systems?

whack · on May 9, 2016

As an ex-hardware engineer, reliability and thorough testing has been drilled into the heads of every single hardware team. Not because we care more about quality, but because of simple economics: finding a single major bug after tape-out is going to be extremely expensive to fix, and will push out your product launch by months. Finding a single major bug after release just might doom your entire company. Hence why reliability and testing is treated like a life-or-death priority.

In software on the other hand, we're spoilt by how easy to is to fix bugs (compiling your code to generate a binary takes minutes/hours, not months, and it doesn't cost you millions of dollars), and also how easy it is to deploy fixes to customers (just release a patch and when someone complains, tell them to download it). Hence why if you're working at Google/Amazon/Apple/99%-of-software-teams, the rational thing to do is to treat reliability/testing as a 2nd class priority, not a 1st class priority.

Except that there's a small number of software applications where this doesn't hold. Where testing really does need to be treated as a 1st class priority. Problem is, the engineers and managers working in these teams belong to an industry that has a completely different mindset. A mindset that translates itself into all parts of the software stack, which these engineers are then forced to rely on. A mindset that will creep into everyone's minds, by osmosis, no matter how much they try to fight it. A mindset that works fine 99% of the time, but every now and then, leads to events exactly like this.

TeMPOraL · on May 9, 2016

> Problem is, the engineers and managers working in these teams belong to an industry that has a completely different mindset. A mindset that translates itself into all parts of the software stack, which these engineers are then forced to rely on. A mindset that will creep into everyone's minds, by osmosis, no matter how much they try to fight it. A mindset that works fine 99% of the time, but every now and then, leads to events exactly like this.

The famous Worse is Better, aka. 80% solution that isn't even correct (it's fine as long as it doesn't segfault in that 80% of cases) today is better than a proper solution next week.

onion2k · on May 9, 2016

There's a lot of software out there. Every car, every phone, every plane, computer, TV, washing machine, factory robot, etc - every thing works pretty much flawlessly on a day to day basis. Occasionally things need a reboot or a patch but that's all. The number of times software fails in our daily lives is pretty low considering we interact with it so often. I'd contend that software quality is actually staggeringly high. It could be higher, but there's the law of diminishing returns and all that.

ssivark · on May 9, 2016

That is not satisfactory. Some instances of software are more crucial than others eg. software running money transfers must have less tolerance for bugs than a game I play on my phone. The question is not whether software works most of the time, but how badly things go wrong when mistakes happen.

Here's an analogy: A weather forecaster can predict "no hurricane" every single day and have a near-perfect success rate. Needless to say, that's next to useless. (False positives and false negatives have wildly different costs in this context.)

onion2k · on May 9, 2016

At a guess I'd say that the probability of a fault in a piece of software does approximately match the cost of it failing. A phone game is very likely to have more bugs than a banking system. My point is that both applications actually work really well. In my experience premium phones games crash maybe once in every thousand runs. Bank money transfer software crashes perhaps once in every few billion runs (that's a total guess but we'd hear about it if it was higher and there are a lot of bank transfers every day). I think that's quite good.

marssaxman · on May 9, 2016

I'd argue the opposite: software quality is generally about as low as it can possibly be while having any value at all. I do my best to interact with as small a quantity of it as I can manage, since it generally frustrates me so much that my life is better if I continue to do most things manually.

A couple decades of work on compilers, dev tools, drivers, firmware, and other sorts of system software may have warped my perceptions somewhat.

(Edit: I don't say that to pretend some kind of superiority; my code is just as buggy as everyone else's. I've just had many occasions to dig around in other people's code and figure out why it stopped working, after we tried to change something that shouldn't have affected it. Hardware, too, you can't trust that either.)

ksk · on May 9, 2016

>Every car, every phone, every plane, computer, TV, washing machine, factory robot, etc - every thing works pretty much flawlessly on a day to day basis.

What is your definition of "pretty much"? In my experience none of those things you mentioned have ever worked pretty much flawlessly.

Retra · on May 9, 2016

"Flawlessly" is definitely not the word I'd use. "Adequate" maybe. But just about every piece of software I use day to day is frustrating to interact with, and those that aren't might be 'upgraded' at any second.

pnathan · on May 9, 2016

Because the hardware folks are king in those shops, not leading to proper respect and care for software. It has to be a yin/yang situation, not a pyramidal one.

InclinedPlane · on May 9, 2016

It's more that software is becoming a dominant factor in everything. Which is both good and bad. It's bad in that insufficient effort and process is going into ensuring software quality. In a very real sense we're in another "software crisis". Software complexity is growing beyond the bounds of the previous era's "best practices" (of which only a subset were ever actually applied) to manage. While at the same time the impact of software problems is growing, especially because the potential for problems due to interactions of software systems is also growing.

Though in truth that's mostly unrelated to this instance. Real-time control systems for satellites have always been a tough problem, and there have traditionally been growing pains for organizations tackling new designs. Even NASA has had its fair share of stumbles in that regard, for example the loss of the Mars Polar Lander and even the Mars Climate Orbiter (the famous metric/english units fuckup) were substantially due to problems with the software and operational management of the spacecraft.

amk_ · on May 9, 2016

Every hardware platform has unique abilities and constraints. The code that provides the "business logic" part of the device is often one-off and domain- or application-specific. Even if you're running on some general purpose platform, that part of the code is going to be about as battle-tested as anything running on some new architecture - not much.

Edit:

Someone alse pointed this out but it's worth reiterating - hardware systems are simulated extensively and analyzed in various ways for performance, cost, reliability and other factors. Software less so, and test cases, if they exist, might not necessarily capture all the power loss/inconsistent state problems that can occur in the real world. Maybe in the future embedded applications could be analysed with a fuzzer or something to help weed out that kind of once-in-the-field-never-in-the-lab bug.

flyinglizard · on May 9, 2016

It's not exactly under a plain software/hardware dichotomy. Rather, this incident indicates a system design problem - no preplanned failure modes, redundancy, cross checking and thorough testing. Good system design regime should overcome both software and hardware issues, either by detecting the issues in advance or by implementing measures to withstand them during operation.

There's so much that can go wrong in a satellite that expecting software to function well is very naive.

Besides, it's not like testing a satellite is easy as running unit testing on a Python module - there are countless operational and environmental variables that are very difficult to synthesize.

outworlder · on May 9, 2016

I have failed to see evidence that the software was faulty.

If anything, it tried to do what it was supposed to: stop the satellite from spinning.

From the report:

> Feb 28, JAXA sent the commands to update the RCS control parameter based on the center of mass changes by deployment of the EOB. Through the post-incident investigation, it is confirmed the RCS control parameter on Feb 28th was not appropriate.

So the software worked as designed, but was fed crappy data.

fixermark · on May 9, 2016

Indeed. And the question of whether we consider software consuming data from a physical device that doesn't correctly account for the possibility of that device feeding it garbage to itself be crappy is a largely academic question.

The whole system has to be considered for success.

gloriousduke · on May 9, 2016

Would you say it's easier to spec out hardware than software? I don't work with hardware, but from a layman's perspective it seems like you can (and have to be) utterly exacting when it comes to product matching designs, with said designs themselves being very exacting, i.e. using physics equations.

With complex software, it's pretty dang hard to layout an entire program in specifications without going down the provably correct approach, which likely will run into limitations when the real world is involved. Waterfall dev for operating systems of the past come to mind.... Probably the best you can do is be very stringent with your coding standards, something space agencies are the masters of.

extrapickles · on May 9, 2016

With hardware if you are not careful you can get parts that meet spec, but are useless because you forgot to specify something correctly (eg: referenced the wrong dimension, error stacking, etc). Generally the only reason why hardware would be easier to spec is that the output can be simpler (eg: a plate out of this material with holes in these locations).

Then you get the stuff that lurks in the middle that could be a software, electrical or mechanical problem (or all 3), and that is where the bugs like this occur. I think its more a result of splitting up work into domain silos, so they do not think about problems that arise in the other domains that can be fixed easily in another. This can lead to one or more domains getting no time at all to implement their part of the system as they spent most of the time budget waiting on another domain to finish. Watchdogs and other "things are going crazy" sort of belt&braces protection are the first parts of a design to get dropped when in a time crunch.

fixermark · on May 9, 2016

I get the sense that the difference often stems from the nature of software as "words you can run" vs. hardware as "physical manifestation of words."

For quite a bit of software these days, the correct program and the description of the correct program are pretty similar artifacts.

gloriousduke · on May 9, 2016

> For quite a bit of software these days, the correct program and the description of the correct program are pretty similar artifacts.

My brain was going mental mobius strip when I was originally responding and thinking of something along these lines. Maybe what I was getting at is that the an idea like the Halting problem doesn't apply as viciously to hardware?

ksk · on May 9, 2016

One of the reasons is that people in the software field (and in a few others as well) have this urge to look for 'challenging' work, i.e. things that they haven't done a million times before to the point where they've become boring. Which is also why you see a lot of s/w engineers working on untested (anything that can't achieve a year of up-time) clever designs/implementations. Its exciting to be clever, but its boring to be conservative with your design to achieve high reliability.

michael_storm · on May 9, 2016

Question for someone who knows more about satellites than me:

> In satellites, the STT typically gets a good fix and sends the data to the IRU. The IRU uses the data to set its current reading and to measure how far it drifted since the last update. After calculating the drift it uses drift adjustments to compensate for the future drift. Clearly if the compensation calculation is wrong the future readings are going to be wrong. This appears to have played a role since the ACS attempted to correct a rotation that didn’t exist. The erroneous configuration information led the ACS to aggravate, not correct, the rotation.

Does this mean that error will compound if either the attitude or compensation are calculated or performed incorrectly? If so, is there a way to reduce that compounding, perhaps by making them more independent systems? Or am I reading too much into a summary? (And, you know, space is hard.)

walrus01 · on May 9, 2016

There are satellite that use the STT system almost entirely independently of all other systems. Particularly ones that need to remain in a certain orientation for 100% of their service life, such as geostationary telecom and weather satellites that orbit along the equator and are always aimed towards the visible hemisphere of earth. On those, the directional hemisphere and spot beam antennas are fixed in place (or can only move a few degrees motion at best, such as Ku band spot beam antennas), relying on the body orientation of the satellite to service a certain area of the visible hemisphere.

Satellites are designed to go into "safe mode" if certain fault protection events happen, or the multiply redundant control systems/onboard computers don't agree with each other. Safe mode usually means shutting down all nonessential electrical loads and trying to orient themselves so that solar panels receive the greatest amount of charge, while listening for command and control data on their omni (L and S band) TT&C antennas.

With this event it sounds like something REALLY went wrong since not only did the satellite try to correct a nonexistant wrong orientation via its reaction wheels (reaction wheels are not nearly as powerful in real life as they are in kerbal space program), it then decided to start expending propellant and spun itself up to such a RPM of revolutions that it tore off its own solar panels, and anything else that would be vulnerable to high centrifugal G forces. Automated code that expends propellant is usually checked much more carefully than this, since the amount of propellant is fixed and non renewable, usually the primary constraint on the total service life of the satellite. Most satellites run out of stationkeeping/orientation propellant (or propellant for ion engine delta-V changes) long before their multiply redundant solar/charge controller/battery/computer control systems fail.

drostie · on May 9, 2016

The way I read it, the bug is something like this:

There are three components, one is a Controller which goes out into the world and calculates how we're oriented, the second is basically just a Model in the MVC sense (the IRU). The first Controller has a responsibility to update the Model with whatever it finds out.

The third component is a separate part of the IRU which is essentially both a view on the model (where am I pointed, where am I supposed to be pointed?) combined with a controller which fires some thrusters to try to point you in the right direction.

The problem was that the first component wasn't updating the model. Therefore the second component set "fireThruster1: true," but then never got any information that it was approaching the right direction -- therefore kept that variable set.

The thruster continuously fired, spinning the satellite around and around until it was ripped apart by centrifugal forces.

2PetitsVerres · on May 10, 2016

> Does this mean that error will compound if either the attitude or compensation are calculated or performed incorrectly? If so, is there a way to reduce that compounding, perhaps by making them more independent systems?

It's more the opposite. (well, not quite exactly the opposite, but I'll try to explain. It's a simplified explanation, I hope I didn't make any obvious error in the simplification)

IRU is mainly composed of gyroscope, measuring the spacecraft angular rate over three axis. They don't give you the absolute orientation. Of course, if you know the initial orientation, you could integrate angular rotation and have the current one. Except that in reality, you have several problems. The first one is that even if you have a perfect unbiased gyro, you need a continuous time integration, and not use sample every 0.X seconds. But the main problem of gyro are the instantaneous bias, and the slow change in bias over time. And when you integrate bias over a long time, the result diverge (all the type of gyroscope have these, but the level are different. There is also added noise on top and quantisation of the output. You get value over 8, 12 or 16 bits, not a real number. But integrating this should average to 0, so that's ok)

So you can't use only an IRU.

Could you use only a star tracker? Again, in an ideal world, yes. The star tracker gives you the absolute orientation and if you want the angular rate, you can get it by derivation (by differentiating, in discrete time) But the star tracker has its own problem. It cannot gives you an orientation when you turn it on, it has an acquisition phase first. It does not work when you have the sun in the field of view, or the earth, and sometimes also the moon. (it's basically a camera trying to take picture of the stars.Plus a lot of complicated software) As it includes a lot of complicated software, you don't necessarily want to use it in safe mode (software has bug. Also it uses electrical power) Sometime STT don't work when the rotation speed is too high also. And complexity also means more failure modes.

So STT only for all phases, including safe mode, if often not accepted by the system engineer/the satellite final client/the quality assurance department/... (sometimes its just not possible)

So what is usually done on satellite including both an IRU and a STT is to blend their data together. That gives you an accurate position (coming from the STT when it is ok, or by integrating the IRU data for "short" period when you get the sun in the field of view, for example), an estimation of the (current) gyro bias, and an accurate rotation rate (by using the IRU data minus the estimated bias)(this is usually based on a (extended) Kalman filter, but there are probably other methods)

With all this, you actually get a better attitude and rotation rate than IRU or STT alone. (when going in safe mode, you will probably switch to an attitude estimation based on IRU and some inaccurate but really simple sun sensor. Safe mode usually only wants to point some satellite axis to the sun with a bad accuracy, up to a few degree)

When everything works fine, it's nice.

But according to what they say, it seems that they have got an error in the algorithm (or some specific bug triggered by some strange sequence). There was the Earth in the field of view of the STT (so it was unavailable), then it has been to acquisition mode (no usable data), the tracking mode (data is good). At this point, there is a huge gap in the bias estimation (this kind of stuff happens when you re initiate a filter, or when conditions change), it's supposed to converge relatively fast to the good value (with "fast" being dependent on your algorithm). But during this convergence phase, they think that the STT did go back to acquisition mode. (for some reason, not completely clear). The estimated bias had a relatively high value (21 deg/h. Big for a bias), not corresponding to the real bias. But the satellite has keep using this value, until it finally reached safe mode.

It's not clear if it continued to use this bias value in safe mode, but it does not really matter (21 deg/h bias is not that big when you start using thrusters. It may use more fuel than needed but that's 0.3 deg/s. Relatively small). An error in other data uploaded in the software had basically transformed the safe mode to a kill mode (from what I understand, any transition to safe would have been extremely bad)

nurblieh · on May 9, 2016

Ironically, the secondary payload that launched with Hitomi was a micro-sat which monitors space debris.

https://www.frontier.phys.nagoya-u.ac.jp/en/chubusat/chubusa...

smegel · on May 9, 2016

I wonder if a physics simulator would have predicted this outcome, and if in fact they have such a simulator for testing both the hardware and software together.

pavel_lishin · on May 9, 2016

It wasn't a matter of them not knowing what would happen if the satellite was stressed, it was a case of bad data that kept reporting that the satellite was spinning. The software then tried to correct the spin, which resulted in an actual spin, in the opposite direction, that kept accelerating since it was under the impression that the corrections weren't working.

fixermark · on May 9, 2016

I'd love to see a full post-mortem. It smells like something was very off in either the hardware design or software configuration, but not knowing their architecture, it's very hard to say with any certainty what could have been improved. A couple of questions I have:

- other systems I know of that care deeply about their attitude have multiple redundant sensors in place to "vote" on a consensus output in case one or more of them fails. Was that the case in this hardware design? If not, why not? If yes, how did the collective answer end up a constant error?

- did they have other sensors (such as a strain gauge) that could have been integrated into the model to spot-check this kind of failure mode? A rule like "If the satellite 'feels' like it's tearing itself apart, stop accelerating" could perhaps have been useful (on the other hand, it'd leave the craft vulnerable to other known failure modes, such as "thruster stuck in the on position and must be countered by another thruster to keep the craft stable," which almost killed one of the U.S. manned missions).

walrus01 · on May 9, 2016

Something as simple as software to prevent extended firings of a thruster for any reason would have worked. In a LEO satellite it's constantly being exposed to night/day cycles and isn't in danger of draining the batteries in safe mode, no matter what orientation it is. LEO satellites have low-bandwidth TT&C (tracking telemetry and control) omnidirectional antennas and radio systems in the L and S bands that don't particularly care about the orientation of the satellite. Code as simple as "if thruster tries to fire for greater than period of time, call exception, place satellite in safe mode" would have worked. Using ground based TT&C systems it's possible to manually reorient a satellite in safe mode, or query what its star tracker sees.

smegel · on May 9, 2016

So was it a faulty sensor then, not bad software? It's not really clear.

> but its data apparently was wrong, reporting a rotation rate of 20 degrees per hour, which was not occurring. The satellite attempted to stop this erroneous rotation using reaction wheels. The satellite configuration information uploaded earlier was wrong and the reaction wheels made the spin worse.

So both the data and configuration were both wrong, or is this one in the same?

2PetitsVerres · on May 10, 2016

It depends on what you call a "physics simulator". I don't know what they do at JAXA, but I would guess that they have at least some kind of simulator. But on thing which is difficult to avoid is to feed your simulator configuration with the same value as your satellite.

From what I understood, the failure (the final one, in safe mode, which is not the initial error but the one which killed the satellite) seems to be an error in thrust parameters. Somewhere in the software satellite, you evaluate your rotation speed. In safe mode, you probably want to nullify this speed. Let's say that you have measured 1 deg/s around the X axis. The controller will say "give me a torque of - 10 Nms around X (10 is a made up values, and I don't take the inertia tensor into account)

The next stage of the software will convert this "-10 Nms around X" to valve opening of one or more thruster. To do this conversion, it must know how much torque is generated by each thruster. This information must be on board, but to compute it, you need: each thruster position, each thruster direction, each thruster force intensity, the satellite center of mass position.

It's not a problem, this information is available somewhere, in a database. The propulsion engineer, together with people doing the CAD model, the people computing the mass parameters, they have got these values. Someone has checked them also.

Now imagine a scenario where this has been tested but the error has not been found. I don't say it happen like this , but I've encountered this kind of situations. Fortunately, always before launches ;-)

Let's look at the simulator side. What does it need to simulate the actuation of this thruster? Basically, the same information. If the valve is open for t secondes, the torque is t*(F x r). Where does it get this value? Same database, it's the same spacecraft, the same thruster, the same position, the same direction, ...

It absolutely makes sense to not duplicate this value.

But what happens if the value is wrong? For example, if the database says that you should have the thrust vector direction, but you put the direction of the mass flow? You just get -T instead of T for this thruster, both on the simulator and the software. Not a problem, your simulation can still works. (if you do it only for one thruster, it may not work, as you may not have control around one axis. But if you do it for all, or maybe for just two opposite thruster, it still works. It may be or may not be noticed that it does not makes sense)

The verification of this database is an extremely difficult (and boring, IMHO) job. You have trap everywhere. You have this kind of sign error, you need to know if the rotation matrix is from reference frame A to B, or B to A, does a 1 in a bit field means actuating your RW CCW or CW, is the tachometer of the wheel in the same direction, how are the thruster numbered from 1 to X, bonus if you have multiple different ordering method because the software guy, the electrical guy and the mechanical guy use a different one...) And this base is huge. (people managing this merit respect, but we prefer to yell at them because they are always late. Due to our own input, of course)

fixermark · on May 9, 2016

I'd be interested to see a full post-mortem on this. What would they change about their process to avoid this failure-mode in the future?

2PetitsVerres · on May 10, 2016

I'm don't completely agree with the fact that it's called a "software update". Reading different article about it, from what I understand, there was an initial error (software error probably triggered by an hardware error probably due to an upset due to radiation), but this error is not very important. A series of event from this error triggered the safe mode (that's expected), and there was the critical problem.

They had updated parameter in the software describing the torque generated by each thruster (or the center of mass position, or the tensor of inertia, or parameters based on all this) These parameter are software parameters, but updating them is not updating the software. It's software data, not software code.

Of course this does not change the fact that it is a critical error, but it's not exactly software update (IHMO). It's a configuration error. It's strange that they didn't see that in a simulator before updating them, but it's possible that they may have used the same value in the part simulating the software and the part simulating the thrusters themselves.

(note: I'm working in the satellite on-board software/attitude control domain, but not for JAXA, in Europe. Anyway at my current position, I must both test this kind of code, and the parameter used in the code. And checking the parameters is much more difficult, because you must be sure that everyone agree on everything. This includes a lot of basic stuff, but it's a pain in the ass ;-) )

This pdf from JAXA is probably the initial source of all articles. http://global.jaxa.jp/press/2016/04/files/20160428_hitomi.pd... I found it interesting to read, if you know how it works. I would of course prefer to have more detains. I always want more details. But it's for the press...

verytrivial · on May 10, 2016

Hindsight is 20:20 of course, backseat driver etc. but if the on-board systems detected a bad rotation, then started a burn to correct it, presumably it could have been possible to detect that the burn was not "helping" during the burn? And halted the burn? The thrusters aren't usually that powerful, so the erroneous death-spin probably took a fair while to spin up. Even if the sensors were wrong, if you're a computer trying to get variable 'X' into a range, and you apply control 'Y', but 'X' moves further and further away, let go of the controls and ask ask an adult help! (Easier said than done, I know. I'm amazed space engineering works as often as it does! Super hard stuff.)

lloeki · on May 10, 2016

Of course it was using a feedback loop, but GIGO applies:

   The STT and IRU disagreed on the attitude of the satellite.
   In this case the IRU takes priority, but its data
   apparently was wrong, reporting a rotation rate of 20
   degrees per hour, which was not occurring.

Starting from there it would have no way of knowing about the true value of 'X', so the feedback loop was fed with wrong data and just kept taking decisions† that made things worse, especially given that:

   The satellite configuration information uploaded earlier
   was wrong and the reaction wheels made the spin worse.
   [...]
   the ACS attempted to correct a rotation that didn’t exist.
   The erroneous configuration information led the ACS to 
   aggravate, not correct, the rotation.

† Even without misconfiguration, stopping an object from spinning in a vacuum isn't as direct and linear as accelerating/braking in a car, requiring precise coordination of multiple fixed thrusters and/or reaction wheels.

Klasiaster · on May 9, 2016

The point is that you can not build a 100% perfect system, there will be always some mistakes even with a theoreme proven code base they appear on other places, just their number is reducable with much efforts.

kibwen · on May 9, 2016

What altitude was the satellite at when it broke up? Will the debris pose a problem for other satellites, or was it low enough that the pieces will reenter the atmosphere quickly?

Nicholas_C · on May 9, 2016

Is satellite debris an issue? Or does the vastness of the area in which satellites orbit take care of that?

dcposch · on May 10, 2016

Yes, it is a serious issue

https://en.wikipedia.org/wiki/Space_debris

https://en.wikipedia.org/wiki/Kessler_syndrome

DarkTree · on May 10, 2016

Very much a problem. [https://en.wikipedia.org/wiki/Kessler_syndrome]

piyush_soni · on May 10, 2016

And you ask why I am afraid to install that "flash player update". :)

samwestdev · on May 9, 2016

Guys always read the change log before upgrading

askafriend · on May 9, 2016

You mean changelogs like these? https://twitter.com/cirbif/status/728114363563839488

aaron695 · on May 9, 2016

I find this useless advice.

I didn't write or are aware of much of the code I'm part of.

Or remember some of the code I did write.

Or even understand the ramifications of what the change log means.

All assuming it's a well written change log.

You need external process to ensure upgrades don't break things.

Not a 'read the change log' policy so someone can blame you post breakage cause then it'll be easy to find issues in the change log.

ourcat · on May 9, 2016

"Are you sure you want to upgrade your satellite app?"

Yes...

boom...

adrianlmm · on May 9, 2016

I hope you get modded down, we are not in Slashdot.

ourcat · on May 10, 2016

Sorry. I was not aware that there was such a low tolerance for a tiny bit of lightheartedness here.

I hope you lighten up. Peace.

carapace · on May 9, 2016

Problems like this (and Nest thermostats bricking during winter) could be mitigated in the design phase by a thorough understanding of Cybernetics.

fixermark · on May 9, 2016

Interesting thought. Could you expand on it? What aspect of cybernetics could have improved this situation, for example?

carapace · on May 11, 2016

Briefly, Information Theory is getting real world phenomena to behave like symbols, while Cybernetics is getting symbols to behave like real world phenomena.

If you want to solve a math problem, then a computer plus a proper algorithm will suffice. If you want to design a system that interacts with its environment and can achieve goals while maintaining homoeostasis, the name for that is Cybernetics.

"Introduction to Cybernetics" by Ashby has been made freely available in PDF form by his estate. A great and noble service for which I commend them. http://pespmc1.vub.ac.be/ASHBBOOK.html

joezydeco · on May 10, 2016

If by "Cybernetics" you mean "Ancient thermopile-driven furnace control systems", then yes.

pmarreck · on May 9, 2016

Perhaps it should have been powered by Erlang (or Elixir):

https://www.youtube.com/watch?v=96UzSHyp0F8

boxy310 · on May 9, 2016

To people downvoting this: Cybernetics as a discipline is the study of systems governance ("kybernetes" being the Greek word for "governor" or "steersman") [1]. A solid understanding of self-reinforcing positive feedback loops like a bad spin reading leading to more spin thrusting is the exact kind of situation that should be planned for in self-adaptive systems not immediately available to direct control.

[1] https://en.wikipedia.org/wiki/Cybernetics

curiousgal · on May 9, 2016

That's really awesome

pmarreck · on May 10, 2016

May not be 100% relevant but I agree, that video is awesome (and so is Erlang/Elixir)

wyattjoh · on May 9, 2016

For some reason I was expecting this to be related to a Windows 10 update.. Glad it wasn't.