E-Stop and Fuel, software that keeps you awake at night

otakucode · on Feb 11, 2018

Rather than the Therac-25 debacle, I would more recommend looking into the Toyota 'unintended acceleration' case. And the legal fallout from that. Because it is terrifying. Toyota was essentially as grossly negligent as it is possible to be. And the result? The court said there existed no standards they could be held legally liable for violating. So your self driving car? It will be developed by junior developers hired as cheaply as possible, driven like slaves by business-oriented managers who only care about meeting schedules, not given the tools or information needed to do an effective job, with testing time cut short and any claimed 'industry standards' for safe coding ignored. The automotive industry had 90+ coding practices they list as either 'required' or 'recommended'. Toyota followed 4 of those in their code. And the court said this was OK. Do you think Toyota spent tens or hundreds of millions of dollars rebuilding their entire development infrastructure, hiring more competent software engineers, firing the business managers who got people killed by rushing an unsafe product to market, and putting the engineers in charge of all future decisions regarding scheduling and release? No, of course not. If anything, they probably saw it as carte blanche to make things worse.

elcritch · on Feb 11, 2018

Wasn’t the cause of the “unintended acceleration issue” hardware and driver error related rather than software related? That’s what I’ve read in one book about Toyota management (the Toyota Way) and also a Wikipedia entry on the topic says:

> Toyota also claimed that no defects existed and that the electronic control systems within the vehicles were unable to fail in a way that would result in an acceleration surge. More investigations were made but were unsuccessful in finding any defect until April 2008, when it was discovered that the driver side trim on a 2004 Toyota Sienna could come loose and prevent the accelerator pedal from returning to its fully closed position.[4]

Based on those two sources it seems the issue was hardware related, and Toyota may have tried papering over the matt issue. The faulty matt design issue doesn’t support your claim of shoddy software practices and hiring underpaid junior developers. That may still be the case but it appears not to have caused the SA issue.

blattimwind · on Feb 11, 2018

This doesn't change anything GP said — the Denso/Toyota code running in the ECU is literally impossible to test or review. Safety mechanisms and watchdogs are faulty and not operational.

This is not the kind of code that should ever have full authority over a 250 kW drive system, even less so with humans in and around that system.

otakucode · on Feb 11, 2018

There was a hardware issue, Toyota told their engineers one of the boards would have ECC RAM, but went cheap and used non-error-correcting RAM, but that wasn't the primary issue. After the court case ended, the code was subjected to static analysis by various researchers and a litany of problems were immediately found. Race conditions, uninitialized values, lack of fail-safe structure, spaghetti code, etc. I read a much better summary article awhile back that talked about automotive industry standards for embedded design and the working environment at Toyota (where developers had no static analysis tools, and didn't even have access to a bug tracker), but this article covers some of the points. The software was definitely at fault in many of the cases that killed 38 people:

http://www.sddt.com/Commentary/article.cfm?SourceCode=201311...

elcritch · on Feb 12, 2018

That article makes a clearer picture of the issues you were hitting upon. Especially the later software audits. Perhaps you could add it to the Wikipedia page?

ineedasername · on Feb 11, 2018

Because proper design for a safety-critical process shouldn't assume anything. Shouldn't assume that an accelerator pedal will have full range of motion when attempting to disengage, and should be programmed to route around unexpected inputs (or lack thereof)

toast0 · on Feb 11, 2018

This isn't reasonable. A traditional mechanically linked accelerator pedal has the same failure condition. If the pedal becomes stuck during a journey, I don't see how you can distinguish between that and user intentionally holding the pedal steady. Having the brakes override the accelerator is a reasonable safety advance that's more possible with direct computer control of the throttle, however.

steve19 · on Feb 11, 2018

Simply put a load cell in the pedal and if a certain amount of force is applied slow down regardless of position of the pedal.

If the pedal is not stuck, there is a limit to the amount of force you could put on it over a period of time.

If it is jammed, a adult could put a lot of force on it when it is not fully pressed.

retSava · on Feb 11, 2018

I think the problem was that the software didn't take into account the situation where the inputs were both the accelerator floored, and braked floored - it should of course result in "disregard the accelerator input, do brake". The user/driver would have to fully disengage the brake, then re-apply it, for the sw to actually brake. (if I remember correctly)

snerbles · on Feb 10, 2018

For modern industrial applications, the safety circuit is often (edit- see child comment's note on safety PLCs) managed by discrete safety relay hardware such as the AB GuardMaster or Pilz PNOZ. There's a good chance these weren't even available at the time of OP's application!

A common configuration involves emergency stops, guard doors, light curtains, etc. being wired in a pair of loops with the relay. The relay continuously monitors both loops (usually with a phased pulse train), and any interruption or crossover will trip the unit. Only when the loop states return to nominal will the relay permit a reset to re-enable the outputs.

The safety relay's outputs are generally connected to dumb hardware interlocks on the various dangerous bits of the machine.

_trampeltier · on Feb 10, 2018

No, most even small machines arrive with a kind of "Safety Integeated" right in the PLC. Even the smallest PLCs like a Siemens S7 1200 are now aviable with Safety Integrated. So Profi-Safe and ASi-Safe are very common. It does reduce A LOT of wiring. But of course brings new problems, for example RJ45 jack and plugs sometimes break the connection for a very short time if you touch it and you lose a safety packet over Profi-Net and .. boom .. emergency stop.

Special if you use a lot of drives in a machine, any kind of Safety Integrated reduce the wiring a lot and makes cabinets much much smaller.

But on the other side, for just once, yes I like Pilz PNOZ. Easy to use .. and I'm pretty sure you can buy a PNOZ even in 100 years.

snerbles · on Feb 11, 2018

It needs a little extra to meet PLd/e, no? I usually work with Allen-Bradley, my Siemens experience is rather lacking.

Most of our customers would run away screaming from anything that's not hardwired safety, and are too cheap to shell out for a fully-blown safety PLC like the GuardLogix.

_trampeltier · on Feb 11, 2018

Yes of course it needs a bit of extra. But for a pretty "low" price these days. Special if you build the same machine many times, you can safe much wiring time.

Most maintenance people like hardware solutions because they are more easy to bridge ... With a software solution you can show easy for each safety button/switch a message on a display. Do it with a wired safety.

MarkSweep · on Feb 10, 2018

As a programmer, I like this approach to human safety for robots. By putting electrical interlocks on the doors that expose humans to the robot you can make it impossible for a software error to hurt a human.

For some applications where you need to have humans working in the same area with the robot things get a lot hard. You probably need some software involved in enforcing speed limits for robots. The compliance engineers I've talked to say getting safety certification for software is quite arduous. In this case the off-the-shelf solutions the parent child comments mention become valuable.

_trampeltier · on Feb 11, 2018

Just slow down machines is not enought. I'm still stunning how much the operators trust on things like E-Stop and Safety-Doors. They open the doors, don't even look and go into the machine with there hands. Most people have no idea what's behind a E-Stop and Safety-Doors these days. Also how much paper work each time .. sadly the paper part grows very fast .. and does not really help at all. Just worthless paper.

bsder · on Feb 11, 2018

> I'm still stunning how much the operators trust on things like E-Stop and Safety-Doors. They open the doors, don't even look and go into the machine with there hands.

Wow. Where I am we sometimes do this, but never without real thought. And, generally, we revisit that part of the process again and again over time.

One example: initial setup of a CNC machine part. If you don't have $50K-$100K, you are setting this up by hand. You are moving the cutting head while your hand is in the same space. If you screw up, it probably won't rip your hand off, but you will likely wind up with a solid, painful gash and it might break a small bone if you get really unlucky.

People don't respect servos enough. They're remarkably powerful and probably moreso than you expect.

froindt · on Feb 11, 2018

> initial setup of a CNC machine part. If you don't have $50K-$100K, you are setting this up by hand.

Curious what specifically you need your hand inside for? Do you simply mean the machine is on (though entirely inactive) while putting the part in, touching off the part slipping the business card in and out, or something else entirely?

_trampeltier · on Feb 11, 2018

It's not just about CNC machines. All automated machines have to have E-Stops and so on. Mostly operators open door because something is mixed um inside. (To many bottles, broken bottles, broken paper, broken sensor, wathever ..

Of course as more often you can build the same machine, as better you can work out details. But sometimes there is just "one" machine. Then usually have to work out the the flaws first ..

A friend from school cleaned a CNC mill while the machine worked. The safety door was manipulated and the 2D table drove over his hand ..

bsder · on Feb 11, 2018

Yeah, touching off the part as well as touching off all the cutting bits.

It's remarkably easy to get your hand pinched if you take it a little too cavalierly.

snerbles · on Feb 11, 2018

FANUC's Collaborative Robot line comes to mind [0]. Padded, force-limited, and will stop on contact. I haven't had a chance to play with these - they're still pretty new.

[0] https://www.fanucamerica.com/home/products-services/robots/c...

gonzo · on Feb 11, 2018

When I was the CTO & VP of Engineering for Wayport (public, mostly hotel Internet) we designed an Ethernet switch that could use Home PNA or Ethernet PHYs. (Later adapted to also offer VDSL to an in-room modem.)

We also designed our own 802.11 access points.

All of our competitors had at least one fire. In a hotel with hundreds of people asleep. It didn’t matter if they used commercial gear or not. Every one of them had a fire.

We never had one, but I was obsessed with not hurting anyone because we had missed something.

And yes, it kept me up at night.

sleepychu · on Feb 11, 2018

I'm not following, what caused the fire?

DiabloD3 · on Feb 11, 2018

I'm not him, but probably either shoddy PSUs and/or there were lithium batteries involved. It's the curse of "modern" equipment.

isatty · on Feb 11, 2018

Heat, if I had to guess (and no I'm not being sarcastic).

When you design your own hardware there's way too many things that can go wrong.

cube00 · on Feb 11, 2018

> needed to be re-written because the source code could not be produced

Is that a diplomatic way of saying someone lost the source code?

asterius · on Feb 11, 2018

They might not have paid for a source code licence. or they did, but they never made sure they had a copy, just left it with the developer. Surprsingly common for companies to get a big binder of paperwork, an installer disk, and consider it done.

rodrigocoelho · on Feb 10, 2018

Can relate. I 'stayed awake at night' when I had to write a program to calculate salaries.

alex_hitchins · on Feb 11, 2018

Ha! I once wrote a program to calculate commission for the sales people at our company. I remember the director telling me he would love my numbers to be true, but thought it best I had another look. Fat fingered decimals could have resulted in some expensive commissions!

hawktheslayer · on Feb 10, 2018

This was an enjoyable read. While I don't often lose sleep over my code, (it's my kids that cause that), I do often find that my mind is working on solving coding puzzles in my sleep as I will frequently wake up with a spark of insight.

kemonocode · on Feb 11, 2018

My own personal "staying awake at night" case is an application that connected to an ancient version of Banner (A big kludge of an ERP system for universities) to handle new applicants' enrollment process and billing. I'm rather skittish when directly handling money, doubly so when the code was all written in Perl (Which I had to learn in order to re-implement in far more readable PHP, the lesser of two evils, natch) and extremely poorly documented. In retrospective, I should not have accepted a job like that.

fnord77 · on Feb 11, 2018

> a fuel estimation program for a small cargo airline > ...floating point computations

Hello, rounding errors. Oh hell no.

jacquesm · on Feb 11, 2018

You make a valid point, pity to see it downvoted. Please keep in mind that in many dialects of BASIC you didn't have more datatypes than 'string' and 'float', and that the original program used floats. Even so, it is definitely possible to make reliable software using floating point, you just have to know exactly what you are doing and you will spend a lot of time on tests to ensure that you do not end up with nonsense output in edge cases.

Floating point is used regularly in avionics and other fields involving critical computations, it is not the floating point data type itself that is problematic, what is problematic is poorly understanding the underlying implementation details and what kind of limitations that will cause.

A good example is trying to count integers with more bits than the underlying precision of the implementation will allow. But in this particular case floating point would have been my 'tool of choice' anyway, fixed point would have introduced a lot of complications for very little gain - if any - gain and would not be worth it.

So in many cases I would agree with you that rounding errors could cause huge problems but in this particular case the inputs were in ranges where this could not happen, the software was tested exhaustively across all input ranges to ensure well defined behavior.

TylerE · on Feb 11, 2018

Please. Don't kneejerk. Any floating point errors will be multiple orders of magnitude smaller than than the accuracy of either the fuel gauges or the pumps.

DiabloD3 · on Feb 11, 2018

Did you know in the US it's essentially against the law to use floating point math on currency? The situation is more complex than I simplified here, however the complexity is on the order than you'd need a lawyer to explain it to you to really appreciate it.

You can call it a knee-jerk reaction, but the law itself clearly has that bias, and for good reason.

All regulations were paved in blood.

_0ffh · on Feb 11, 2018

Not just the US, IIRC financial institutions around the world are required to do decimal rounding. OTOH I don't see why you would even use floating point for money. Just scale your integers to the required precision (and make sure you've got enough bits for the kind of amounts you need to be able to handle) and you're good.

davidcuddeback · on Feb 11, 2018

Having worked in both aerospace (GP's comment) and e-commerce (your comment), I can tell you that these are not the same problem and do not require the same solution.

First week on the job in e-commerce, I educated the junior engineers on the problems that can arise when using floating point math with currencies, then prioritized work to remove any floating point math from the code. That doesn't mean I wouldn't use floating point anywhere else. In aerospace in particular, I've used floating-point without issue, so I don't think you're point about currencies has any relevancy to the GP's comment regarding fuel estimation.

phpnode · on Feb 11, 2018

Which law covers that?

fnord77 · on Feb 12, 2018

fun fact - before 2001, the US stockmarket traded in sixteenths not decimals

MertsA · on Feb 11, 2018

That may or may not be true depending on how the program is written. All of the math can be correct, but make the mistake of relying on floating point math having the associative and distributive properties (it doesn't). I'd imagine the most trouble prone part of the fuel load calculation to be the cancellation error introduced by naively determining the derivative of some function.

E.g.

  step = 0.001
  fprime = (f(x+step) - f(x)) / step

Having a tiny step value means substantially worse error due to rounding. Instead of being a tiny fraction of a percent off now you've got something potentially dangerous like 30% off. You can't just handwave away loss of precision as something to ignore, there are a handful of common pitfalls that will amplify the total error to the point where it's unacceptably high and not intuitive to figure out the cause of the error.

_0ffh · on Feb 11, 2018

Yeah, it' simple to see in this example. Though it's often harder to see in practice, I once heard/read some interesting advice: Test your software in all four rounding modes. If the software works correctly (and approximately agrees) in all four modes, you're good. (I suppose this assumes that you've got reasonable test coverage.)

KirinDave · on Feb 11, 2018

Not if you use multiplication and floats.

We certainly cannot use it for money, for example. Most shops I know of have fancy decimal libraries they never get to use because an unsigned long unit of centicents is simpler, predictable and easy to encode and assume client software will have the same rounding strategy.

monob · on Feb 11, 2018

Fixed point isn't any better than floating point.

If anything errors in floating point numbers are graceful and give you a lot of leeway before they become catastrophic. Fixed point works until you hit 2(n-1) bits then probably breaks unexpectedly. Where n is the last exponent you have seen in business.

KirinDave · on Feb 11, 2018

Fixed point libraries tend to be better because good ones can cleanly identify when they're unable to continue and often tell you why.

Floating point libraries can't even do this.

kurtisc · on Feb 11, 2018

Just like the variation in the output voltage of a pin might be amplified by an op-amp. These things are taken into account by engineers, it's their duty to mathematically prove they're not an issue and fix it if it is.

joezydeco · on Feb 12, 2018

Just a plug for the RISKS Digest, now in it's 32nd year of operation:

http://catless.ncl.ac.uk/Risks/

knodi123 · on Feb 11, 2018

Geez, and I thought I was stressed after I wrote software that would automate computing someone's pay bonus based on efficiency metrics.

w_t_payne · on Feb 11, 2018

So how can we approach safety in a systematic manner?

Clearly 'blame' isn't an appropriate response. It has to involve tooling.

HeyLaughingBoy · on Feb 12, 2018

Well before tooling is considered, it has to involve people and process. At the highest level, you must have a culture of "blame the process, not the people" or people will do what is natural when things go wrong: try to cover it up and avoid being blamed.

There are procedures in various safety-conscious industries for handling this kind of development. I like that you used the word "systemic" because it is literally a systems issue, not a software, or electronics, or mechanical issue. The entire system has to be considered and analyzed for potential faults.

I spent over a decade writing code for medical devices and while the software aspect of these systems was the most advanced in terms of development process (unlike what many on HN seem to think :-), everything we did had to be considered from a system perspective because even if the individual parts were designed properly, it was possible for the interactions between them to cause problems.

w_t_payne · on Feb 13, 2018

I strongly agree with your emphasis on systems.

jacquesm · on Feb 12, 2018

Procedures and documentation seem to work well for the aviation industry. Things will still go wrong, but only very rarely twice in the same way. It makes development a lot more expensive but it does work and probably is the only way that we are aware of right now that will get this done in a way that leads to acceptable outcomes.

This leads to glacial progress but I find that is preferable over the 'move fast and break shit' mentality that pervades the software industry.

w_t_payne · on Feb 13, 2018

I think that it leads to glacial progress only because it is done badly. (It is done badly pretty much everywhere). I'm trying to develop tooling to make developing safe software fast: https://github.com/wtpayne/hiai (Long way from being finished, unfortunately).

fireismyflag · on Feb 10, 2018

Thank you for sharing this. I enjoyed imagining scenarios so different from the ones that rob MY sleep at night.