The MCAS system is still on the 737MAX. > the software workarounds don't behave ...

bumby · on April 3, 2023

I'm aware that MCAS is still in use. You may be conflating what I'm saying. I'm not making a claim that software is inherently dangerous. I'm making the claim that software as a workaround to sound risk mitigation is dangerous, especially when it's a workaround for a hardware problem because of the interaction effects. It's pretty clear from the hazard analysis and subsequent decisions that Boeing didn't understand and mitigate the MCAS risk effectively.

And, yes, I'm aware of the two philosophical camps regarding ultimate authority in command (pilot vs. software). That doesn't negate my point. Software, in either case, shouldn't be a workaround solely because it's easier to implement than a hardware change. Using it as an engineered mitigation is ok, but you have to actually implement the mitigations properly. For example, the hazard analysis listed MCAS as "critical". With that classification, it required redundant sensors, yet Boeing didn't make that the default and opted instead to sell it as an option. (Never mind the fact that the classification should have been higher, they didn't even follow through with their own processes based on the mis-classification).

WalterBright · on April 3, 2023

> I'm making the claim that software as a workaround to sound risk mitigation is dangerous

Mechanical systems aren't inherently better or more resistant to hidden flaws. Remember it was a failure of the mechanical AOA sensor that initiated the MCAS failure.

> Software, in either case, shouldn't be a workaround solely because it's easier to implement than a hardware change.

Software runs our world now. Mechanical computers on airplanes were around for decades before software, and they were hardly free of fault and problems.

bumby · on April 3, 2023

>Mechanical systems aren't inherently better or more resistant to hidden flaws.

Besides the fact that failure modes of mechanical systems are generally more understood than software, you're again conflating my point and having an entirely different conversation. This isn't about some "mechanical vs. software" dichotomy. I literally said there is nothing about software that makes it inherently dangerous and that using software as an engineered mitigation is fine if the mitigations are implemented properly.

What isn't fine is all the process and design gaps that occurred. Things like using software as a mitigation, not because it was the best alternative, but simply because it was cheaper/faster. Things like not following your own hazard analysis when it comes to mitigation. Or not characterizing the risk accurately or the failure modes because you didn't understand the system interactions.

>Remember it was a failure of the mechanical AOA sensor that initiated the MCAS failure.

This is exactly why, had they followed their own procedures and hazard analysis, redundant sensors would have been a default. A "critical" item (like MCAS in the hazard analysis) is supposed to get redundant input as a default.

My post isn't about mechanical vs. software. It's about the the allure of using software as a workaround that leads to process gaps and bad design philosophy. Like removing sensors because of a supplier/cost issue and assuming "we'll fix it with software" instead of doing the hard work to understand, characterize, and mitigate the risk effectively.

WalterBright · on April 4, 2023

> Like removing sensors because of a supplier/cost issue

There were more than one AOA sensors. Not hooking the other one up to the MCAS system could hardly be a cost issue. Nor is it an issue of software vs hardware. It being implemented in software had nothing to do with MCAS's problems. The software was not buggy, nor was it a workaround. What was wrong was the specification of how the software should work.

bumby · on April 4, 2023

I was bringing the context back to the article of this thread, not talking about Boeing there. Sorry if it lead to confusion.

You can note that I acknowledged there were multiple AOA sensors in other replies. Further, they were already “hooked up” to MCAS, but Boeing made this safety critical redundancy requirement an option within the software. That’s bad practice, full stop.

I still maintain that the software was 100% a mitigation to a hardware change and I think there’s plenty of other evidence supporting that. E.g., has they not updated their engines, would MCAS be installed? If the answer is no, then it was a mitigation to a risk introduced by a hardware change.

I’ll say it one more time just to be clear: I’m not saying the concept was bad. I’m saying their design philosophy and implementation was bad. They could have used software mitigation within the right process/philosophical framework just fine. What doesn’t work is using software as an “easy” risk mitigation strategy when you don’t understand the risk or the processes necessary to fully mitigate that risk. The problem arises because software is a seductive fix because it’s relatively easy and cheap on its surface, but if your design philosophy and processes are equipped to implement it effectively, that “easy” fix is rolling the dice.

WalterBright · on April 4, 2023

> That’s bad practice, full stop.

No disagreement, there.

> it was a mitigation to a risk introduced by a hardware change

It wasn't really a risk. It was to make it behave the same.

Allow me to explain something. A jet airliner is full of hardware and software adjustments to the flying characteristics. For example, look at the wing. What do you think the flaps and slats are for? They are to completely change the shape of the wing, because a low speed wing is very very different from a high speed wing. There are also systems to prevent asymmetric flaps, as that would tear the wings off.

The very existence of the stab trim is to adjust the flying characteristics. The stab trim has an automatic travel limiter to constrain the travel as the speed increases because, you guessed it, full travel at high speed will rip the tail off.

The control columns are connected to a "feel computer" which pushes back on the stick to make the airplane feel in a consistent way from low speed to high speed. This feel computer can be mechanical or software. Pilots fly by the force feedback on the stick, not the travel of it. The idea is to make the airplane "feel" like a completely different airplane. Without the feel computer, they'd promptly rip the airplane apart.

There are plenty more of these. The MCAS concept is no different in any substantive way.

Your thesis that using software to run it as some unusual risk is simply dead wrong. What was wrong with the MCAS system was:

1. reliance on only one sensor

2. too much travel authority

3. it should have shut itself off if the pilot countermanded it

What was also wrong was:

a. Pilots did not use the stab trim cutoff switch like they were trained to

b. The EA pilots did not follow the Emergency Airworthiness Directive sent to them which described the two step process to counter MCAS runaway

There weren't any software bugs in MCAS. The software was implemented according to the specification. The specification for it was wrong, in points 1..3 above.

P.S. Mechanical/hydraulic computers have their own problems. Component wear, dirt, water getting in it and freezing, jamming, poor maintenance, temperature effects on their behavior, vibration affecting it, leaks, etc. Software does not have those problems. The rudder PCU valve on the 737 had a very weird hardover problem that took years to figure out. It turned out to be caused by thermal shock.

bumby · on April 4, 2023

In various times in my past I've been a private pilot, airframe mechanic, flight-control-computer engineer, aerospace test & evaluation software quality engineer, and aerospace software safety manager. I've even worked with Boeing. So I am quite familiar with these concepts.

>It wasn't really a risk. It was to make it behave the same.

Hard disagree here. The fact that it did not behave the same and lead to mishaps shows there is a real risk. That risk could have been mitigated in various ways (e.g., engineering via hardware or software, administrative via training etc.) but they did not. You downplaying the risk as not credible is making the same mistake.

">There weren't any software bugs in MCAS. The software was implemented according to the specification.

I'm not claiming there were bugs. This seems to be a misattribution regarding how software fails. There are more failure modes than just "bugs". It can be built to spec but still wrong. This is the difference between verification and validation. Verification means "you built it right" (ie meets specs) while validation means "you built the right thing" (ie it does what we want). You need both and in this instance, there's a strong case they didn't "build the right thing" because their perspective was wrong.

>Your thesis that using software to run it as some unusual risk is simply dead wrong.

My thesis is that they didn't know how to effectively characterize the software risk because, as you point out, software risks are different than the risks of mechanical failure. Software doesn't wear-out or display time-variant hazard rates like mechanical systems. Rather, it incurs "interaction failures." The prevalence of these failures tends to grow exponentially as the number of systems that software touches increases. It's a network effect of using software to control and coordinate more and more processes and is distinct from buggy software failures. Which is why we need to shift our thinking away from the mechanical reliability paradigm when dealing with software risk. Nancy Leveson has some very accessible write-ups on this idea. There's nothing wrong with using software to mitigate risk, as long as you're actually characterizing that risk effectively. If I keep thinking about software reliability with the same hardware mentality you're displaying, I'll let all those risks fall through the cracks. They may have verified the software met specs, but I could also claim they didn't properly validate their software because you usually can't without understanding the larger systemic context in which it operates.

So what does that mean in the context of Boeing and, in a broader sense, Tesla? Boeing did not capture these interaction risks because they had an overly simplified idea of the risk and mitigations. They did not capture the total system interactions because they were myopically focused on the software/controls interface. They did not capture the software/sensor interface risk (even though their HA identified that risk and required redundant sensor input). They did not capture the software/human interface risk, which led to confusion in the cockpit. They thought it was a "simple fix". Tesla, likewise, is trying to mitigate one risk (supplier/cost risk) with software. TFA seems to implicate them in not appropriately characterizing the new risks they introduced with the new approach. I'm saying that is a result of a faulty design philosophy that downplays (or is ignorant of) those risks.

diebeforei485 · on April 4, 2023

Neither will the Boeing 777 or 787!