Something that is in the process of happening, iirc. Linux is turning into somet...

i336_ · on Jan 6, 2017

I read (unfortunately unsure where) that microkernels do have one fundamental issue: having servers do All The Things and then just making a kernel to dispatch calls to those servers horribly falls down if the messaging/dispatch implementation is single-threaded.

And it inevitably always is, since if you're generalizing all system operations onto a single bus, that bus would either need to support some generic form of contextualization hinting or have some kind of theorem-solver-inspired system to determine what requests have no dependencies. I don't suspect Minix incorporates either approach...

The problem I see is the need to put "these are audio frames" in a different queue than "here are filesystem request packets". (Ideally the filesystem queue would itself allow further sharding, since most filesystems are multithreaded now.)

Writing such a generalized queue sounds like a rather fun exercise to me.

That said, if any such implementations are out there or there are any counter-arguments to make to this, I'd love to hear them. I mean, AFAIK Mach is a microkernel, so it's clearly solved some of this.

lmm · on Jan 10, 2017

I think multithreading the messaging/dispatching implementation would add more overhead than it saved. I remember the hurd's core message-passing routine is 26 assembly instructions - there's simply not a lot of computation involved, and in general not enough data for the message-passing to be the bottleneck - when you're transfering bulk data you'd use shared memory or at least DMA or the like (in a sensible microkernel you just do it, in a super-purist microkernel you'd have a server that owned bulk data buffers and your regular processes pass handles around rather than actually owning the data and it's fine too).

If you need a queue with particular properties you write one, as its own userspace process (or system of cooperating processes). The kernel dispatcher isn't assumed to be a fully general messaging system.

i336_ · on Jan 11, 2017

Hmm, interesting.

I am wondering about one thing though.

> there's simply not a lot of computation involved

Wow, 26 instructions.

Here's my worst-case scenario: you have 8 concurrent threads (a current reality on POWER8), and let's say all of them are engaged in fetching large amounts of data from different servers - let's say disk and TCP I/O are both servers.

I'm genuinely curious how well a 26-instruction-but-singlethreaded message passing system would hold up. (I honestly don't know.)

Worst case scenario, the cache and branch predictor would perpetually resemble tic-tac-toe after an earthquake.

---

I think it would be genuinely interesting to throw some real-world workloads at Minix, Hurd, etc, and see how they hold up.

Now I'm wondering about ways to preprocess gcc's asm output to add runtime high-resolution function timing information that (eg) just writes elapsed clock ticks to a preallocated memory location (within the kernel)... and then a userspace process to periodically read+flush that area...

lmm · on Jan 24, 2017

> Here's my worst-case scenario: you have 8 concurrent threads (a current reality on POWER8), and let's say all of them are engaged in fetching large amounts of data from different servers - let's say disk and TCP I/O are both servers.

Speculating: if you were passing all the data in messages, terribly. But that's not how you'd handle it. You'd use messages as a control channel instead, similar to DMA or SIMD instructions. E.g. if you're downloading a file to disk, the browser asks to write a file, the filesystem server does its thing to arrange to have a file and gets a DMA channel from the disk driver server. The TCP layer likewise does its thing and gets a DMA channel from the network card driver, and either the browser or a dedicated bulk-transfer server connects them up. The bulk data should never even hit the processor, yet alone the message-passing routines.

> I think it would be genuinely interesting to throw some real-world workloads at Minix, Hurd, etc, and see how they hold up.

Do. Also look at QNX which is the big commercial successful microkernel.

> Now I'm wondering about ways to preprocess gcc's asm output to add runtime high-resolution function timing information that (eg) just writes elapsed clock ticks to a preallocated memory location (within the kernel)... and then a userspace process to periodically read+flush that area...

I'd look at something along the lines of perf_events ( which I encountered via http://techblog.netflix.com/2015/07/java-in-flames.html ).

i336_ · on Jan 24, 2017

Using messages as a control channel sounds awesome, wow.

One of the targets I've been trying to figure out how to hit is how to make message-passing still work if you're using it in the dumbest way possible, eg using the message transport itself to push eg video frames. I'm slowly reaching the conclusion that while it'll work, it'll just be terrible, like you say.

I mention this because, at the end of the day, most web developers would just blink at you all like "DM-what?" if you suggested this idea to them. These types of techniques are simply not in widespread use sadly.

In my own case, I'm not actually sure myself how you use DMA as a streaming transport. I know that it's a way to write into memory locations, but I don't know how you actually take advantage of it at higher levels - do you use a certain bit as a read-and-flush clock bit? Do you split the DMA banks into chunks and round-robin write into each chunk so that the other side can operate as a "chaser"? I'm not experienced with how this kind of thing is done.

Well, workload-testing microkernel OSes is now on my todo list, buried along with "count to infinity twice" :) (I really will try and get to it one day though, it is genuinely interesting)

Regarding QNX, I actually mentioned that to the other person who replied in this thread (https://news.ycombinator.com/item?id=13346822), and I said a few other words about it a couple months ago - https://news.ycombinator.com/item?id=12777520

I really wish the QNX story had gone ever so slightly differently :'(

Regarding perf_events and the linked blog post, thanks for both - this is really interesting!

lmm · on Jan 30, 2017

> In my own case, I'm not actually sure myself how you use DMA as a streaming transport. I know that it's a way to write into memory locations, but I don't know how you actually take advantage of it at higher levels - do you use a certain bit as a read-and-flush clock bit? Do you split the DMA banks into chunks and round-robin write into each chunk so that the other side can operate as a "chaser"? I'm not experienced with how this kind of thing is done.

I don't know enough to answer this stuff - last message was already second-hand info (or worse). All I can say is, best of luck.

digi_owl · on Jan 7, 2017

I am of limited knowledge regarding micro-kernels.

As i have come to understand there is one successful such kernel out there, QNX. And that while both the OSX/iOS and Windows NT kernel started out as a micro design, both Apple and Microsoft have been moving things in and out of the kernel proper as they try to balance performance and stability (Most famously with Windows, the graphics sub-system).

i336_ · on Jan 7, 2017

QNX is such a mixed story of technical ingenuity and frustration.

The OS was cautiously courting a "shared source" model where you could agree to a fairly permissive (but not categorically pure-open-source) license and get access to quite a few components' source code.

It was anybody's guess what might develop from that, and an intriguing and hopeful time.

And then BlackBerry came along and bought QNX and killed the shared source initiative. Really mad at BB for deciding to do that.

Nowadays QNX is no longer self-hosting - no more of that cool/characteristic Neutrino GUI anymore :(