Once when I was much younger first encountering NUMA I came across a bit of code that always did a few magical things with the first physical CPU. We had a fairly elaborate hardware configuration and several peripherals were bound to the first core.
Ever since, when I do any sort of CPU or core pinning or grouping, I leave the first core or two alone if possible. It's convenient to say the first two cores are for all the ancillary stuff, but the last 30 are for our workload.
Now I realize why I never witnessed something like this in all benchmarks I performed... I always ignore the first CPU core.
>But for as long as the kernel defines a cpumask_first() function, I will always worry that some other Linux feature might employ it to force unrelated, latency-spiking workloads upon that core.
Very Linux. It's complex, very complex, and full of footguns like this.
The real answer is to use seL4[0]. Among its extensive formal proofs, it has proof of WCET (Worst Case Execution Time), therefore unlike Linux and other hopelessly non-deterministic unorthogonal toy OSs, seL4 can actually guarantee latency bounds.
We've built things on seL4 (https://github.com/auxoncorp/ferros) before. We like to joke that it's the most perfect piece of nearly featureless software ever made.
There's... A LOT... of work to do before seL4 is going to be anywhere near usability parity with something like Linux, unfortunately.
Rather than make a general purpose OS, we decided to use it more like a unikernel or "library OS" where you're trying to make a well defined kind of "appliance" image to deploy to specific hardware rather than try to fake being a POSIX-y shaped OS. At some point we plan to leverage the WASM ABI to help with stability across versions & toolchains and stuff a WASM runtime in there too, so that support for "applications" written in something other than Rust is easier. Though, it's a pretty low priority at the moment since it's not at all core to our business presently.
> At some point we plan to leverage the WASM ABI to help with stability across versions & toolchains and stuff a WASM runtime in there too, so that support for "applications" written in something other than Rust is easier.
the GP was talking about guaranteed low latency execution. You're orbiting a different star here.
I love the idea of seL4, and was sorely disappointed to see Australia cut their funding. Decades of exploits have me convinced that microkernels are the only real path towards security.
We at one point created a specialty OS for some exotic network hardware designed for very fancy robots. For a brief period we were using seL4+Ferros like kind of a hypervisor used as the integration point for our "system-level" V&V tools based on what we learned making the aforementioned OS, but we abandoned that approach for non-seL4 reasons. We'll offer something of a "black box recorder" for robots and other cyber-physical systems that's also for runtime verification.
Well, seL4 is now managed under a foundation as part of the Linux Foundation umbrella. The seL4 Foundation. Ultimately, I think it might be for the best. Now there are more seL4 researchers and practitioners out in industry working to integrate and build upon seL4 as a building block that never would have existed because they'd previously all been living under the same roof.
huh, cool, I had just been looking at ferros a few hours ago, and now coincidentally it comes up for discussion on HN.
I was meaning to circle back and see how difficult it is to throw together uni-kernel type work with it, but didn't see much in the way of documentation for how to ... use it.
Is the general idea here to target use of sel4+Rust for embedded work? Or is it feasible to think about this stuff in the context of server workloads on e.g. x86_64 machines?
I have a fantasy of writing a database engine that sits close to baremetal, completely managing its own memory for page buffers and so not engaging in negotiations with the Linux virtual memory system at all.
But such an application would need a robust networking and block device I/O stack. Something I'm not sure I can expect in sel4 land.
That might be less of a challenge than you think. One of our engineers added a nice little KV store via TickV (https://github.com/jonlamb-gh/ferros-sabrelite-toy-system) to be able to do persistent storage in a useful way on a per application basis.
Our current plan, whenever we can get to it, is to implement a SQLite VFS layer (they did a really nice job abstracting away the database from the storage) to be able to host SQLite in a FerrOS task so that your local filesystem is a SQLite database.
Just email me (nathan@auxon.io) or post some issues on GH if you want to connect with folks who work on FerrOS. Heck, depending on what happens with some of our SBIRs we'll probably end up hiring a couple people to work on it full-time to do the stuff I mentioned.
Does ferros+sel4 have a workable story for working on a hypervisor? e.g. is there anything built around virtio & block devices and so on? (Just circling back on my original probing around where the emphasis of the project is... embedded or cloud... Arm SoC of x86_64, etc.)
seL4 is for application processors, so CPUs you'd normally think of, not MCUs like you'd typically use in embedded systems. When developing for something we start out running against targets virtualized in QEMU. Whether or not you could get seL4 booting under a hypervisor that's providing/emulating less of a complete platform? I'm sure it's possible. We never bothered to try propping it up on Xen or anything like that. I have to believe someone has tried to. With respect to FerrOS, it wouldn't make any difference. It will work once the kernel is up and brokering access to resources.
unlike Linux and other hopelessly non-deterministic unorthogonal toy OSs
You should team up with the Hurd folks. They have, what, 30 years of the same impotently shaking their fists at the rest of the world touting how their perfect little academic exercise and its tens of users is the future of computing if everyone else on the planet wasn't so st00pid. Really. Any day now.
They really have the angry old CS professor schtick down. You could learn a lot.
For HFT you generally just shove the OS out of the way. It exists to provide basic access the machine and for the non-fast-paths, while the hot paths never touch it in the first place. You also aren't hard-real-time, where it's literally failure if you ever exceed a latency target.
In this context, Linux is fairly sensible as it has decent knobs for getting out of the way and is well supported for all of your non-fast-path-code.
For clarity: The HFT space does things like CPU pinning (the cores their tasks run on are cut out of the main cgroup into a dedicated one for their tasks). Likewise, memory regions are preallocated and static. Important resources like network stacks are done in user space - sure, the kernel has an IP address, for the 1g nic that's used for management, but the IB card that powers the trading interconnect (you are less likely to find IB these days, so listen to my story as a little dated) is mapped directly into userspace; there's no driver for it in the kernel.
It's like a meme in the world of trading: your system needs to survive for 24 hours, whatever happens after that - doesn't matter. This can really impact the way you design the system. Ignoring memory leaks in C++, or outright disabling the GC in Java is one of the legit trade-offs you can make.
Early in my career, before "turbo" or "boost" frequencies were a thing in commodity CPU products, I heard a similar tale about boosting the speed of the signal processors in a defensive radar system.
The thought was that you would rather overheat and perform the calculation than protect the circuit and lose the ship. It's kind of a hail mary strategy. It definitely gave me a visceral feeling for how one should be ready to re-frame their systems analyses, even if apocryphal.
There are in service commercial planes that need to be routinely rebooted to prevent counters from overflowing, so I really wouldn't be surprised about similar sorts of issues in single use devices.
Another technique used with Java is setting the memory really high, run it for a while against mock data until it's warmed up and the compiler has JITted out all the allocations and then move to real data.
If the mock data is good, you can do the whole thing in testing and fail tests if new code introduces GC events. However the Java code you have to write end up being very C like.
The notion of the JIT warming up is nowhere near as rosy as most people believe. In particular, it is neither true that the JIT will lead you to your program performing at a steady level, nor that such a steady level will be faster than what you started with. C.f. https://tratt.net/laurie/blog/2018/why_arent_more_users_more...
Though perhaps one could write a program in a way that gets reliably jitted?
The talk is ostensibly about generating graphs, but it also describes a patch to make Linux prefer the previously-used core for any task, to make massively parallel tasks not jump around and end up with two threads on the same core.
Hi, I'm the author. I'm glad you enjoyed it, and pleasantly surprised that Tanel posted it already. I was right in the middle of posting it myself an hour ago when I noticed it had been posted and had fostered a discussion already :)
Thanks for writing this up and doing the work preceding the writeup. Every time I read your articles, I learn something important and new (and I consider myself pretty good in the Linux perf/troubleshooting space :-)
Thanks for the props :-) This might not mean that much here. But one very useful thing that came out of the long Oracle RDBMS performance career is that now I'm not afraid of addressing (attacking?) systems of any complexity. As the Oracle DB is one of the most complex pieces of software out there (it can also be a cache-coherent distributed system, BTW) so once you learn how to handle that, plus all the OS, storage interactions, you'll be capable of dealing with most (if not all) RDBMS out there :-)
Agreed - multi-core processing is now ubiquitous, from a Raspberry Pi on down, and negative affinity concepts ought to be in every junior engineer's toolkit (NOT that they should be cargo-culted, but understood and deployed where it is beneficial).
I've always considered the second, or second fastest, CPU core to be the real King for this reason. The OS will want to use the first CPU core for various things, for whatever reasons they came up with, sometimes by accident, and without warning.
I doubt OS's ever purposely utilize the second core lol.
Ever since, when I do any sort of CPU or core pinning or grouping, I leave the first core or two alone if possible. It's convenient to say the first two cores are for all the ancillary stuff, but the last 30 are for our workload.
Now I realize why I never witnessed something like this in all benchmarks I performed... I always ignore the first CPU core.