The Rapid Growth of Io_uring

rayiner · on Feb 7, 2020

One of the comments to the article is really interesting. The recent Meltdown stuff has really blown up the cost of privilege transitions, because now people expect non-architectural data not to leak across privilege boundaries. System calls are about twice as slow as they used to be. Meanwhile, I/O is faster than ever, with PCIE and NVME. Io_uring offers the opportunity to avoid privilege transitions through asynchronous calls based on writing to a memory buffer shared between user space and the kernel. That has the potential for fundamentally changing the basis system call interface. As the article hints at, the trick is designing the API so you can construct as large a block of work to be done asynchronously as possible. At the limit, you could really push down the core of your I/O loop into the kernel, hence the suggestions that BPF programs could be submitted through the ring to chain operations all within the kernel.

(Incidentally, this is a good illustration of the flexibility of UNIX’s model of describing everything with a file descriptor. The same interface meant for asynchronous file I/O was easily extended to network I/O.)

stefan_ · on Feb 7, 2020

> hence the suggestions that BPF programs could be submitted through the ring to chain operations all within the kernel

That sounds unbelievably annoying.

No, the limit has been found and it is not in the kernel. You push the I/O loop up for userspace I/O, more specific, not more general. They call it SPDK [1] (or DPDK for networking) and as far I can tell, the principle is essentially having a dummy driver in the kernel that maps the entire PCIe peripheral memory space into your chosen process, and everything flows from there.

At the I/O limit, asynchronous isn't feasible because interrupts introduce latency and waste cycles not doing work. All userspace I/O frameworks work only through polling.

1: https://spdk.io/

couchand · on Feb 7, 2020

> At the I/O limit, asynchronous isn't feasible because interrupts introduce latency and waste cycles not doing work.

io_uring supports polled i/o: https://lore.kernel.org/linux-block/20190116175003.17880-8-a...

rayiner · on Feb 7, 2020

The problem with mechanisms like DPDK is that they bypass all the infrastructure in the kernel and make it hard to play well with others using the same hardware or services. DPDK, for example, bypasses the TCP/IP stack. SPDK bypasses the VFS. You can write your own TCP/IP stack or filesystem on top of those things, but then you can't play well with other processes using those services. While some GPUs can directly multiplex command streams from different processes, most hardware cannot.

saber6 · on Feb 7, 2020

That's the point of DPDK: to get the kernel out of the way of packet processing.

Userland packet processing (in network context) is much more flexible and less brittle than forcing certain functionality to exist solely in the Kernel layer. However things do exist that allow you to (mostly) transparently re-jigger a standard app's TCP/IP calls. One such example is using LD_PRELOAD to "hijack" the sys-calls for certain things and snake it to your (super high performance) userspace app!

There's a lot of exciting stuff happening in the open source networking world (DPDK, VPP/FDio, Network Service Mesh, etc). I really recommend digging into it!

wtallis · on Feb 7, 2020

SPDK is a pain to use. io_uring supports polling, and that gets it within a few percent of SPDK performance.

MayeulC · on Feb 8, 2020

Interestingly, the advent of these asynchronous, context-switch-less interfaces might see microkernels make a comeback, as they were originally decried for performance reasons (they traditionally need many expensive context switches to operate).

hinkley · on Feb 8, 2020

I wonder too if in this batched model if there's some opportunity to further leverage the 'race to idle' solution you see in mobile devices. If I have 16 cores and half of them are frequently spun down, the thermal budget for the others is improved.

willvarfar · on Feb 7, 2020

So I started googling io_uring, and came across this excellent talk at FOSDEM https://fosdem.org/2020/schedule/event/rust_techniques_sled/

jen20 · on Feb 7, 2020

This was indeed a great talk, and the Rio library it describes is also incredibly usable.

blattimwind · on Feb 7, 2020

Turns out, if you give people a sane API, they actually want to use it! And this is the first relatively sane way on Linux/BSD to do asynchronous IO. :)

(I say "relatively sane" because this isn't really asynchronous, it's just make-believe with a kernel-managed thread pool, because I/O being a fully synchronous affair is ingrained far too deep into both Linux and BSD I/O stacks.)

wtallis · on Feb 7, 2020

> "relatively sane" because this isn't really asynchronous

What's your criteria for "really asynchronous"?

> it's just make-believe with a kernel-managed thread pool

To the extent that io_uring uses anything resembling a thread pool, it seems to me that it is used completely differently from how a userspace AIO thread pool operates. When a userspace AIO implementation submits IO to the kernel, it does so with a blocking syscall and that thread stalls until that IO is complete. That means the number of outstanding IOs is limited by the number of threads in the pool. I don't see any such limitation in using io_uring to deliver IO requests to the block layer.

tyingq · on Feb 7, 2020

>What's your criteria for "really asynchronous"

One example might be that related operations, like stat(), opendir(), readdir(), getpeername(), and so on...remain synchronous. And that async functionality is mostly a bolt-on to very established things, file descriptors, berkeley sockets etc.

Also, every improvement is a pretty big patchset with code to ensure the traditional synchronous operations don't get unintended side effects.

Just generally the idea that a "clean start", non-POSIX bound OS might design things differently, I imagine. Google's Fuchsia seems to hit some middle ground, where async is more foundational, for example.

I don't think that observation detracts from the improvements.

chriswarbo · on Feb 7, 2020

> Just generally the idea that a "clean start", non-POSIX bound OS might design things differently, I imagine. Google's Fuchsia seems to hit some middle ground, where async is more foundational, for example.

Microsoft's (research, discontinued) Midori OS was heavily async:

http://joeduffyblog.com/2015/11/19/asynchronous-everything

tyingq · on Feb 7, 2020

That's exactly where I was going. Critisim of Linux async shouldn't be discouraged because of inherent limitations. That would hample innovation. I wouldn't be surprised by a new web server, API gateway, load balancer, etc, that mandated "forget about POSIX, adhere to this". The whole cloud abstraction movement seems to enable it.

yxhuvud · on Feb 7, 2020

statx exist as opcode, so stat through the ring is a solved problem. Directory operations are not implemented yet, but I'd be surprised if they don't arrive, sooner than later.

tyingq · on Feb 7, 2020

I view that as confirmation of the GP observation. Async is an afterthought for Xnix.

blattimwind · on Feb 7, 2020

> I don't see any such limitation in using io_uring to deliver IO requests to the block layer.

You can already have asynchronous IO from userspace to the block layer via linux-aio and O_DIRECT. But the VFS remains synchronous, so both uring and userspace can effectively only use a thread pool to work around that. Or magic.

mzs · on Feb 7, 2020

Actually this is FB wants something bad enough and Jens Axboe works for them.

matheusmoreira · on Feb 7, 2020

Linux finally has something comparable to Windows I/O Completion Ports. Looks like it is flexible enough to be a generic asynchronous system call interface. Awesome!

willvarfar · on Feb 7, 2020

io_uring puts me in mind of a 2012 prototype of async and batched syscalls that sped up MySQL by 40%! https://www.usenix.org/legacy/events/osdi10/tech/full_papers...

The cost of syscalls has gone up massively due to all the mitigations against recent side channel attacks. Batching and async syscalls would be an even bigger win than ever.

Wouldn't it be great if glib or something could adopt new mechanisms and everyone got faster?

api · on Feb 7, 2020

In the long term I think OS APIs need to be redesigned (or at least the calling mechanism) around something more like io_uring for a lot of things. Shared memory, lock-free data structures, no syscalls.

cesarb · on Feb 7, 2020

> Shared memory, lock-free data structures, no syscalls.

You'd still need a syscall to sleep when the ring is empty, and a syscall to wake up the mechanism when the first request is put on an empty ring. But yeah, other than that (and perhaps an "yield" syscall), you could do everything through the ring.

yxhuvud · on Feb 7, 2020

You can actually go totally syscall free with io_uring, but that requires privileged mode as the setup call to enable the SQPOLL flag will fail otherwise. But yeah, I suppose most people don't like running things as sudo.

couchand · on Feb 7, 2020

Well, not totally free. If you don't submit an event before the sq poll timeout the kernel thread will sleep and you need to call enter again to wake it up.

tyingq · on Feb 7, 2020

A popular standard cross process safe api to use mmap() for IPC, kv cache, locking, etc, might help as well. I understand it's already possible, but it's either reinvented every time, or eschewed for cross-host...often when not needed.

BubRoss · on Feb 7, 2020

Like this? https://github.com/LiveAsynchronousVisualizedArchitecture/si...

willvarfar · on Feb 7, 2020

And, back in 2012, cores were expensive. Now we have more cores than software to run on them, and having the kernel and user-space in respective busy nop loops waiting to get something is finally something to fill them up with?

The dynamics and mechanical sympathy of different approaches are changing as hardware goes massively parallel faster than software can.

BubRoss · on Feb 7, 2020

Reinventing QNX will be cutting edge for decades

hinkley · on Feb 8, 2020

I’d be curious to see some statistics about the number of system calls per second now versus ten years ago. I have a suspicion they’re much higher.

If the scale of the problem changes, so does the solution. We know this.

mavam · on Feb 7, 2020

For users who need macOS and FreeBSD (kqueue) support as well, is there any unified standard for async I/O that covers both file and network? Or is the only choice to go with a library like libuv, which will always pick the optimal native implementation under the hood?

earenndil · on Feb 7, 2020

Once freebsd's linux implementation supports io_uring, you'll be able to use that there too. Nothing for macos, though, I'm afraid.

cpach · on Feb 7, 2020

I’m afraid I don’t follow. What do you mean by “freebsd’s linux implementation”?

jacobush · on Feb 7, 2020

FreeBSD can run linux binaries:

https://www.freebsd.org/doc/handbook/linuxemu.html

I wonder though... wouldn't a safe first way of implementing these new syscalls be to make them actually synchronous?

That way you'd be able to run these Linux binaries but without any of the performance benefits.

cesarb · on Feb 7, 2020

> I wonder though... wouldn't a safe first way of implementing these new syscalls be to make them actually synchronous?

No, because it visibly changes the semantics. Consider for instance IORING_OP_ACCEPT; if you make it synchronous, and nothing connects to your program, it would wait forever, instead of returning immediately and allowing the program to continue. The file-related opcodes are safer (when used with actual files, instead of network sockets), but still would behave differently for instance with a hanging NFS mount.

trasz · on Feb 7, 2020

Better way would be to provide those as native syscalls and then provide Linuxulator wrappers over those.

yxhuvud · on Feb 7, 2020

Is there plans to support it?

trasz · on Feb 7, 2020

Not until it becomes actually used by real-world code, I suppose.

blattimwind · on Feb 7, 2020

> Or is the only choice to go with a library like libuv, which will always pick the optimal native implementation under the hood?

You mean thread-pools for file I/O?

kashyapc · on Feb 7, 2020

From last sunday, at FOSDEM 2020, a talk[1] by QEMU developer Julia Suvorova, on integrating io_uring into QEMU (the open source machine emulator and virtualizer), and related performance implications.

From the abstract:

"iouring enhances the existing Linux AIO API, and provides QEMU a flexible interface, allowing you to use the desired set of features: submission polling, completion polling, fd and memory buffer registration. By explaining these features we will come to examples of how and when you need to use them to get the most out of iouring. Expect many benchmarks with different QEMU I/O engines and userspace storage solutions (SPDK).

You will get a brief overview of the new kernel feature, how we used it in QEMU, combined its capabilities to speed up storage in VMs and what performance we achieved. Should io_uring be the new default AIO engine in QEMU? Come and find out!"

[1] io_uring in QEMU: high-performance disk I/O for Linux — https://fosdem.org/2020/schedule/event/vai_io_uring_in_qemu/

willvarfar · on Feb 7, 2020

I used to do asynchronous io with epoll, aio etc and spend time benchmarking my servers vs completion ports and kqueue etc.

eventually libraries like Libuv turned up and made my life a lot easier.

Are there any good stats on how io_uring compares to all those older async io stuff?

GrayShade · on Feb 7, 2020

Not bad, according to one test: https://github.com/libuv/libuv/issues/1947#issuecomment-4852....

Note that it might have been optimized more in the meantime.

Matthias247 · on Feb 7, 2020

That is for file IO, for which there were not many working async options available. I think the parent was rather interested in how it compares for IO loads where epoll & co are already alternatives.

thedance · on Feb 7, 2020

Check the PDF that the article references, which also links to https://lore.kernel.org/linux-block/20190116175003.17880-1-a...

khc · on Feb 7, 2020

epoll doesn't allow you to do async io to local storage, and io_uring doesn't target the network use case, so they are not quite comparable.

topspin · on Feb 7, 2020

> io_uring doesn't target the network use case

There are several io_uring opcodes intended specifically for the network use case, including: IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_SENDMSG and IORING_OP_RECVMSG.

Did you mean something else by 'target'?

khc · on Feb 7, 2020

iirc it wasn't created for network IO, but I could be wrong

yxhuvud · on Feb 7, 2020

The reason it was created and what it targets are different things. io_uring is evolving to be a very general way to do async interaction with the kernel, and the way it is built enables high performance also for things that isn't disk IO.

jabl · on Feb 7, 2020

Though it was originally designed for block io, IIRC io_uring has since added support for network sockets.

Don't know how it compares to epoll performance-wise.

wbl · on Feb 7, 2020

You have to allocate buffer per socket with io_uring, rather then reuse for each one that is ready. Most applications do this anyway though.

GrayShade · on Feb 7, 2020

This is addressed in the article comments: https://lwn.net/Articles/810558/.

gpderetta · on Feb 7, 2020

I think the usual trick of using 0 sized buffers for readiness and just do the read synchronously might work.

In fact io_uring might have direct support for polling now.

yxhuvud · on Feb 7, 2020

Eh, nothing stops you from using a single buffer, but with different offsets for each open call. So I don't really see the problem unless you have so much buffer space allocated that it becomes problematic.

cesarb · on Feb 7, 2020

A "single buffer, but with different offsets for each open call" is identical to "allocate one buffer per socket" (you "allocated" the per-socket buffer by subdividing a larger allocation). And even if you don't "have so much buffer space allocated that it becomes problematic", it's still a waste of memory.

frevib · on Feb 7, 2020

Here are echo servers with

io_uring: https://github.com/frevib/io_uring-echo-server

epoll: https://github.com/frevib/epoll-echo-server

kqueue: https://github.com/frevib/kqueue-echo-server

io_uring performs much better than epoll: https://twitter.com/hielkedv/status/1218891982636027905?s=21

cytzol · on Feb 7, 2020

Question: is io_uring only worth using for writing to files, or does it provide performance benefits when reading from them, as well?

I've seen database programmers rave about the speed benefits of asynchronous I/O, because they have to store a lot of data on disk. But the majority of programs I write only have to deal with reading files. I'd love to try using io_uring, but only when it's appropriate.

willvarfar · on Feb 7, 2020

Imagine a database trying to solve

    SELECT client_id, invoice_id, invoice_status, invoice_date
    FROM invoices
    WHERE client_id IN (123, 567) AND invoice_status != 'PAID';

Imagine that there is an index for client_id.

The DB engine can scan the index, finding all the pages in the table that contain rows with one of those two client_ids in.

It requests these pages as soon as it finds them in the index.

It scans the pages in the order it finds them completed, though, which can be a completely different order than it requested them in.

thristian · on Feb 7, 2020

Asynchronous I/O is only useful if you have something else you can be doing while you wait for file-data to arrive. For example, if you can do calculations on the part of the file you've already read while you wait for the rest to arrive, or if you can process data for one client while data from a different client is in-flight. Databases are a good example - usually they're handling queries for multiple clients at once, and even within a query they can be reading from multiple places simultaneously (like the individual tables involved in a JOIN).

If your program needs to read in an entire JSON blob (for example) before it can do anything, or if it only does light processing on each individual part (like adding up a column of a CSV file) then async I/O probably isn't going to help.

machinecoffee · on Feb 10, 2020

Which would actually work nicely with a fibres library, so you can yield while you're waiting for the io to complete.

yxhuvud · on Feb 7, 2020

You can for example read from multiple files simultaneously.

You can also build web servers that read from disk as part of the request handling, without having to do blocking file reads. Before io_uring, this scenario would get absolutely horrible performance as the whole thread would lock down while waiting for the file read. Now you can do other things concurrently.

cmsd2 · on Feb 7, 2020

it looks like in the future it might be a better way of implementing khttpd by handing over from a userspace control program the accept->read->sendfile->close sequence of async operations to the kernel

executesorder66 · on Feb 7, 2020

Is there any userspace software that is actually making use of io_uring?

saghul · on Feb 7, 2020

There is an ongoing effort to bring it into libuv: https://github.com/libuv/libuv/pull/2322 Looks like using liburing might be the way to go, since it's MIT licensed.

birkelund · on Feb 7, 2020

QEMU just merged io_uring support as an aio engine for block I/O.

PudgePacket · on Feb 7, 2020

https://github.com/spacejam/sled via https://github.com/spacejam/rio