Great Watchdog Timers For Embedded Systems (2016)

keeda · 2025-02-01T03:14:47 1738379687

His may not be a familiar name 'round these parts, but Jack Ganssle is a legend in the embedded systems industry. And he's very helpful to boot.

Many lifetimes ago, as a freshly baked software engineer, I had a strong interest in Embedded Systems. Juggling interrupts, wrangling registers, counting clock cycles, banging bits, reading sensors, often in raw assembly so as to fit into limited flash memory was my idea of fun.

So much so that I was contemplating doing a Master's degree in that area. However, I couldn't find a US University with a good program for that. I had seen Jack being very active on a few embedded-related forums, and on a lark, I emailed him for advice.

And he responded! He gave me very sound advice, effectively explaining that graduate research in Embedded Systems is quite distinct from the actual low-level work that happens in the industry. This explained why I hadn't found any of the programs appealing. I took his advice gratefully and pursued a different area for my Master's, which shaped the rest of my career. Thanks again, Jack!

I always intended to come back to Embedded Systems at some point, but unfortunately it never worked out. Partially because embedded engineers are criminally underpaid for the complexity of the work they do. As the article hints, writing software that runs reliably in arbitrarily harsh environments on low-cost, cheap, quirky hardware with extremely constrained resources is a different level of challenging.

sephamorr · 2025-02-01T00:52:52 1738371172

I've taken a number of MCUs through radiation testing including testing watchdogs. I've generally found that latchups often take out the watchdogs, even something like the STm32 independent watchdogs, and shouldn't be relied on. External hardware or a different system need to be deputized here.

mofosyne · 2025-02-01T12:29:06 1738412946

Be interesting if there is a rad hard watchdog. Maybe you could end up having standard off the self parts and then use a reliable watchdog to intervene as needed.

Especially if such watchdog has the ability to switch to a secondary or third backup...

edit: Shoutout to wildzzz for pointing out ISL706ARH Active Rad-Hard, 5.0V/3.3V µ-Processor Supervisory Circuits

sephamorr · 2025-02-01T18:55:04 1738436104

Right, having a second chip functioning as a watchdog which doesn't share the same silicon substrate mitigates the latchup. If you can recover before both theoretically could get struck, statistically you'll be okay. For added safety, have an extra one that is not coplanar with the other two so a single heavy ion strike can't pass through all 3 chips.

wildzzz · 2025-02-01T04:01:16 1738382476

Huge shout-out to the ISL706ARH for saving our butt through heavy ion testing.

vvanders · 2025-02-01T02:58:08 1738378688

WDT patterns are highly underrated, even in pure software there's value in degrading/recovering gracefully vs systems that have to be "perfect" 100% of the time and then force user intervention when they go wrong.

One of my favorite blogs on the topic https://ferd.ca/the-zen-of-erlang.html that does a great job of covering how Erlang approached the topic, lots of learnings that can be applied more broadly.

akoboldfrying · 2025-02-01T01:58:31 1738375111

>Toggle the WDT input too slowly, too fast, or not at all, and a timeout will occur.

Reminds me of an article I read a few years ago about designing systems to detect when (human) train drivers fall asleep at the wheel. Apparently it was an arms race for a long time: Designers kept coming up with increasingly complicated tasks for drivers to complete to signal their conscious state, like tapping buttons with their hands or feet at various time intervals, while drivers, for their part, kept figuring out ways to perform those tasks while actually functionally unconscious...

astrobe_ · 2025-02-01T09:47:25 1738403245

The "dead man's switch" being defeated by sleepwalkers.

You have a similar situation as train drivers when you inadvertently kick the watchdog in an infinite loop. Or when a newbie thinks there's nothing wrong with it kicking it in some long loop.

sho_hn · 2025-02-01T02:26:24 1738376784

And then trains adopted reCAPTCHA.

ZevsVultAveHera · 2025-02-01T14:40:07 1738420807

Is this for real?

petee · 2025-01-31T23:23:21 1738365801

> The 1750's built-in watchdog timer hardware was not used, over the objections of the lead software designer

I wish they delved into this a little deeper; was it because the WDT disables with one op? That does seem quite risky on its own

pjbk · 2025-02-01T06:57:37 1738393057

Unless this was not a Honeywell 1750 but a variant from another vendor (pretty common), the 1750 had only 2 hardware timers with just dedicated interrupts and no true watchdog functionality like a kicking key or handshake. The CPU has a "trigger-go" counter that cannot be disabled that was meant as a monitor for software resets, but not a real watchdog implementation.