At some point, whatever's watching the watchers is going to be vulnerable to bit...

axiolite · on Jan 4, 2021

> At some point, whatever's watching the watchers is going to be vulnerable to bitflip

One advantage of microkernels is that the "watcher" is so small that it could be run directly from ROM, instead of loaded into RAM. QNX has advocated that route for robotics and such in the past.

Minix may not be the best example of the type. While it is a microkernel, it's real world reliability has been poor in the past. More mature microkernel operating systems like QNX and OpenVMS are better examples.

exikyut · on Jan 4, 2021

> While it is a microkernel, it's real world reliability has been poor in the past.

Nitpick/clarification: it currently supervises the security posture, attestation state and overall health of several billion(?) Intel CPUs as the kernel used by the latest version of the Management Engine.

If ME is shut down completely apparently the CPU switches off within 20 minutes. Presumably this applies across the full uptime of the processor, and not just immediately after boot, and iff this is the case... percentage of Intel CPUs that randomly switch off === instability/unreliability of Minix in a tightly controlled industrial setting.

KMag · on Jan 4, 2021

Anyone have any idea why there haven't been any open-source QNX clones, at least not any widely known ones? Even before their Photon MicroGUI patents expired, the clones could have used X11.

I used to occasionally boot into QNX on my desktop in college. It was a very responsive and stable system.

Hypervisors are, to a first approximation, microkernels with a hardware-like interface. All of this kernel bypass work being done by RDBMSes, ScyllaDB, HFTs, etc. is, to a first approximation, making a monolithic kernel act a bit like a microkernel.

fulafel · on Jan 4, 2021

There are well known open source microkernels, like Minix 3 and L4. Probably not that attractive.

Why something hasn't been done is always a hard question to answer, since to succeed a lot of things have to go right, and by default none of them do. But one thing is that microkernels were more trendy in the 90s - r&d people are mostly doing things like "the cluster is the computer", unikernel, exokernel, rump kernel, embedded (eg tock), remote attestation since then (I'm not up to date on the latest).

KMag · on Jan 4, 2021

Thinking about it a bit more, QNX clones might suffer from something akin to second system syndrome. There's a simple working design, and it likely strongly invites people to jump right to their own twist on the design before they get very far into a clone.

tomxor · on Jan 4, 2021

> Minix may not be the best example of the type. While it is a microkernel, it's real world reliability has been poor in the past. More mature microkernel operating systems like QNX and OpenVMS are better examples.

You might be referring to the previous versions. Minix 3 is basically a different OS, it's more than an educational tool - in fact it's probably running inside your computer right now if you have an Intel CPU (it runs Intel's ME chip - for better or worse).

tomxor · on Jan 4, 2021

Yes, but this is the entire principle around which microkernels are designed: making the the last critical piece of code as small and reliable as possible. Minix3's kernel is <4000 lines of C.

As far as bitflips are concerned, having the critical kernel code occupy fewer bits reduces the probability of a bitflip causing an irrecoverable error.

Jedd · on Jan 5, 2021

Yes, I understand this -- basic risk mitigation by reducing the size of your vulnerability.

(I'll archaic brag a bit by mentioning I used to be a heavy user of Minix - my floppy images came in over an X25 network - and saw Andy Tanenbaum give his Minix 3 keynote at FOSDEM about a decade ago. I'm a big fan.)

Anyway, while reducing risk this way is laudable, and will improve your fleet's health, as per TFA it's a poor substitute, with bad economics and worse politics behind it, than simply stumping up for ECC.

I'll also note that, for example, Google's sitting on ~3 million servers so that ~4k LoC just blew out to 12,000,000,000 LoC -- and that's for the hypervisors only.

Multiply that out by ~50 to include VM's microkernels, and the amount of memory you've now got that is highly susceptible to undetected bit-flips is well into the mind-blowing range.

tomxor · on Jan 5, 2021

Oh i'm not saying it's the single best solution, I guess I got carried away in argument - It's simply a scenario where the concept shines, yet it's entirely artificial scenario and I agree ECC is the correct way.

aquadrop · on Jan 4, 2021

You can have two watchers watching each other's integrity.