> That’s bad practice, full stop. No disagreement, there. > it was a mitigation ...

bumby · on April 4, 2023

In various times in my past I've been a private pilot, airframe mechanic, flight-control-computer engineer, aerospace test & evaluation software quality engineer, and aerospace software safety manager. I've even worked with Boeing. So I am quite familiar with these concepts.

>It wasn't really a risk. It was to make it behave the same.

Hard disagree here. The fact that it did not behave the same and lead to mishaps shows there is a real risk. That risk could have been mitigated in various ways (e.g., engineering via hardware or software, administrative via training etc.) but they did not. You downplaying the risk as not credible is making the same mistake.

">There weren't any software bugs in MCAS. The software was implemented according to the specification.

I'm not claiming there were bugs. This seems to be a misattribution regarding how software fails. There are more failure modes than just "bugs". It can be built to spec but still wrong. This is the difference between verification and validation. Verification means "you built it right" (ie meets specs) while validation means "you built the right thing" (ie it does what we want). You need both and in this instance, there's a strong case they didn't "build the right thing" because their perspective was wrong.

>Your thesis that using software to run it as some unusual risk is simply dead wrong.

My thesis is that they didn't know how to effectively characterize the software risk because, as you point out, software risks are different than the risks of mechanical failure. Software doesn't wear-out or display time-variant hazard rates like mechanical systems. Rather, it incurs "interaction failures." The prevalence of these failures tends to grow exponentially as the number of systems that software touches increases. It's a network effect of using software to control and coordinate more and more processes and is distinct from buggy software failures. Which is why we need to shift our thinking away from the mechanical reliability paradigm when dealing with software risk. Nancy Leveson has some very accessible write-ups on this idea. There's nothing wrong with using software to mitigate risk, as long as you're actually characterizing that risk effectively. If I keep thinking about software reliability with the same hardware mentality you're displaying, I'll let all those risks fall through the cracks. They may have verified the software met specs, but I could also claim they didn't properly validate their software because you usually can't without understanding the larger systemic context in which it operates.

So what does that mean in the context of Boeing and, in a broader sense, Tesla? Boeing did not capture these interaction risks because they had an overly simplified idea of the risk and mitigations. They did not capture the total system interactions because they were myopically focused on the software/controls interface. They did not capture the software/sensor interface risk (even though their HA identified that risk and required redundant sensor input). They did not capture the software/human interface risk, which led to confusion in the cockpit. They thought it was a "simple fix". Tesla, likewise, is trying to mitigate one risk (supplier/cost risk) with software. TFA seems to implicate them in not appropriately characterizing the new risks they introduced with the new approach. I'm saying that is a result of a faulty design philosophy that downplays (or is ignorant of) those risks.