One of the comments to the article is really interesting. The recent Meltdown stuff has really blown up the cost of privilege transitions, because now people expect non-architectural data not to leak across privilege boundaries. System calls are about twice as slow as they used to be. Meanwhile, I/O is faster than ever, with PCIE and NVME. Io_uring offers the opportunity to avoid privilege transitions through asynchronous calls based on writing to a memory buffer shared between user space and the kernel. That has the potential for fundamentally changing the basis system call interface. As the article hints at, the trick is designing the API so you can construct as large a block of work to be done asynchronously as possible. At the limit, you could really push down the core of your I/O loop into the kernel, hence the suggestions that BPF programs could be submitted through the ring to chain operations all within the kernel.
(Incidentally, this is a good illustration of the flexibility of UNIX’s model of describing everything with a file descriptor. The same interface meant for asynchronous file I/O was easily extended to network I/O.)
> hence the suggestions that BPF programs could be submitted through the ring to chain operations all within the kernel
That sounds unbelievably annoying.
No, the limit has been found and it is not in the kernel. You push the I/O loop up for userspace I/O, more specific, not more general. They call it SPDK [1] (or DPDK for networking) and as far I can tell, the principle is essentially having a dummy driver in the kernel that maps the entire PCIe peripheral memory space into your chosen process, and everything flows from there.
At the I/O limit, asynchronous isn't feasible because interrupts introduce latency and waste cycles not doing work. All userspace I/O frameworks work only through polling.
The problem with mechanisms like DPDK is that they bypass all the infrastructure in the kernel and make it hard to play well with others using the same hardware or services. DPDK, for example, bypasses the TCP/IP stack. SPDK bypasses the VFS. You can write your own TCP/IP stack or filesystem on top of those things, but then you can't play well with other processes using those services. While some GPUs can directly multiplex command streams from different processes, most hardware cannot.
That's the point of DPDK: to get the kernel out of the way of packet processing.
Userland packet processing (in network context) is much more flexible and less brittle than forcing certain functionality to exist solely in the Kernel layer. However things do exist that allow you to (mostly) transparently re-jigger a standard app's TCP/IP calls. One such example is using LD_PRELOAD to "hijack" the sys-calls for certain things and snake it to your (super high performance) userspace app!
There's a lot of exciting stuff happening in the open source networking world (DPDK, VPP/FDio, Network Service Mesh, etc). I really recommend digging into it!
Interestingly, the advent of these asynchronous, context-switch-less interfaces might see microkernels make a comeback, as they were originally decried for performance reasons (they traditionally need many expensive context switches to operate).
I wonder too if in this batched model if there's some opportunity to further leverage the 'race to idle' solution you see in mobile devices. If I have 16 cores and half of them are frequently spun down, the thermal budget for the others is improved.
Turns out, if you give people a sane API, they actually want to use it! And this is the first relatively sane way on Linux/BSD to do asynchronous IO. :)
(I say "relatively sane" because this isn't really asynchronous, it's just make-believe with a kernel-managed thread pool, because I/O being a fully synchronous affair is ingrained far too deep into both Linux and BSD I/O stacks.)
> "relatively sane" because this isn't really asynchronous
What's your criteria for "really asynchronous"?
> it's just make-believe with a kernel-managed thread pool
To the extent that io_uring uses anything resembling a thread pool, it seems to me that it is used completely differently from how a userspace AIO thread pool operates. When a userspace AIO implementation submits IO to the kernel, it does so with a blocking syscall and that thread stalls until that IO is complete. That means the number of outstanding IOs is limited by the number of threads in the pool. I don't see any such limitation in using io_uring to deliver IO requests to the block layer.
One example might be that related operations, like stat(), opendir(), readdir(), getpeername(), and so on...remain synchronous. And that async functionality is mostly a bolt-on to very established things, file descriptors, berkeley sockets etc.
Also, every improvement is a pretty big patchset with code to ensure the traditional synchronous operations don't get unintended side effects.
Just generally the idea that a "clean start", non-POSIX bound OS might design things differently, I imagine. Google's Fuchsia seems to hit some middle ground, where async is more foundational, for example.
I don't think that observation detracts from the improvements.
> Just generally the idea that a "clean start", non-POSIX bound OS might design things differently, I imagine. Google's Fuchsia seems to hit some middle ground, where async is more foundational, for example.
Microsoft's (research, discontinued) Midori OS was heavily async:
That's exactly where I was going. Critisim of Linux async shouldn't be discouraged because of inherent limitations. That would hample innovation. I wouldn't be surprised by a new web server, API gateway, load balancer, etc, that mandated "forget about POSIX, adhere to this". The whole cloud abstraction movement seems to enable it.
statx exist as opcode, so stat through the ring is a solved problem. Directory operations are not implemented yet, but I'd be surprised if they don't arrive, sooner than later.
> I don't see any such limitation in using io_uring to deliver IO requests to the block layer.
You can already have asynchronous IO from userspace to the block layer via linux-aio and O_DIRECT. But the VFS remains synchronous, so both uring and userspace can effectively only use a thread pool to work around that. Or magic.
Linux finally has something comparable to Windows I/O Completion Ports. Looks like it is flexible enough to be a generic asynchronous system call interface. Awesome!
The cost of syscalls has gone up massively due to all the mitigations against recent side channel attacks. Batching and async syscalls would be an even bigger win than ever.
Wouldn't it be great if glib or something could adopt new mechanisms and everyone got faster?
In the long term I think OS APIs need to be redesigned (or at least the calling mechanism) around something more like io_uring for a lot of things. Shared memory, lock-free data structures, no syscalls.
> Shared memory, lock-free data structures, no syscalls.
You'd still need a syscall to sleep when the ring is empty, and a syscall to wake up the mechanism when the first request is put on an empty ring. But yeah, other than that (and perhaps an "yield" syscall), you could do everything through the ring.
You can actually go totally syscall free with io_uring, but that requires privileged mode as the setup call to enable the SQPOLL flag will fail otherwise. But yeah, I suppose most people don't like running things as sudo.
Well, not totally free. If you don't submit an event before the sq poll timeout the kernel thread will sleep and you need to call enter again to wake it up.
A popular standard cross process safe api to use mmap() for IPC, kv cache, locking, etc, might help as well. I understand it's already possible, but it's either reinvented every time, or eschewed for cross-host...often when not needed.
And, back in 2012, cores were expensive. Now we have more cores than software to run on them, and having the kernel and user-space in respective busy nop loops waiting to get something is finally something to fill them up with?
The dynamics and mechanical sympathy of different approaches are changing as hardware goes massively parallel faster than software can.
For users who need macOS and FreeBSD (kqueue) support as well, is there any unified standard for async I/O that covers both file and network? Or is the only choice to go with a library like libuv, which will always pick the optimal native implementation under the hood?
> I wonder though... wouldn't a safe first way of implementing these new syscalls be to make them actually synchronous?
No, because it visibly changes the semantics. Consider for instance IORING_OP_ACCEPT; if you make it synchronous, and nothing connects to your program, it would wait forever, instead of returning immediately and allowing the program to continue. The file-related opcodes are safer (when used with actual files, instead of network sockets), but still would behave differently for instance with a hanging NFS mount.
From last sunday, at FOSDEM 2020, a talk[1] by QEMU developer Julia Suvorova, on integrating io_uring into QEMU (the open source machine emulator and virtualizer), and related performance implications.
From the abstract:
"iouring enhances the existing Linux AIO API, and provides QEMU a flexible interface, allowing you to use the desired set of features: submission polling, completion polling, fd and memory buffer registration. By explaining these features we will come to examples of how and when you need to use them to get the most out of iouring. Expect many benchmarks with different QEMU I/O engines and userspace storage solutions (SPDK).
You will get a brief overview of the new kernel feature, how we used it in QEMU, combined its capabilities to speed up storage in VMs and what performance we achieved. Should io_uring be the new default AIO engine in QEMU? Come and find out!"
That is for file IO, for which there were not many working async options available. I think the parent was rather interested in how it compares for IO loads where epoll & co are already alternatives.
There are several io_uring opcodes intended specifically for the network use case, including: IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_SENDMSG and IORING_OP_RECVMSG.
The reason it was created and what it targets are different things. io_uring is evolving to be a very general way to do async interaction with the kernel, and the way it is built enables high performance also for things that isn't disk IO.
Eh, nothing stops you from using a single buffer, but with different offsets for each open call. So I don't really see the problem unless you have so much buffer space allocated that it becomes problematic.
A "single buffer, but with different offsets for each open call" is identical to "allocate one buffer per socket" (you "allocated" the per-socket buffer by subdividing a larger allocation). And even if you don't "have so much buffer space allocated that it becomes problematic", it's still a waste of memory.
Question: is io_uring only worth using for writing to files, or does it provide performance benefits when reading from them, as well?
I've seen database programmers rave about the speed benefits of asynchronous I/O, because they have to store a lot of data on disk. But the majority of programs I write only have to deal with reading files. I'd love to try using io_uring, but only when it's appropriate.
Asynchronous I/O is only useful if you have something else you can be doing while you wait for file-data to arrive. For example, if you can do calculations on the part of the file you've already read while you wait for the rest to arrive, or if you can process data for one client while data from a different client is in-flight. Databases are a good example - usually they're handling queries for multiple clients at once, and even within a query they can be reading from multiple places simultaneously (like the individual tables involved in a JOIN).
If your program needs to read in an entire JSON blob (for example) before it can do anything, or if it only does light processing on each individual part (like adding up a column of a CSV file) then async I/O probably isn't going to help.
You can for example read from multiple files simultaneously.
You can also build web servers that read from disk as part of the request handling, without having to do blocking file reads. Before io_uring, this scenario would get absolutely horrible performance as the whole thread would lock down while waiting for the file read. Now you can do other things concurrently.
it looks like in the future it might be a better way of implementing khttpd by handing over from a userspace control program the accept->read->sendfile->close sequence of async operations to the kernel
There is an ongoing effort to bring it into libuv: https://github.com/libuv/libuv/pull/2322 Looks like using liburing might be the way to go, since it's MIT licensed.
(Incidentally, this is a good illustration of the flexibility of UNIX’s model of describing everything with a file descriptor. The same interface meant for asynchronous file I/O was easily extended to network I/O.)