I'm not surprised by rhe reuslt. Another commenter said:

> async I/O is faster because it avoids context switches and amortizes kernel crossings

I think this is widely believed, but it's not particularly true for async I/O (of the coroutine kind meant by async/await in Python, NodeJS and other languages, rather than POSIX AIO).

With non-blocking-based async I/O, there are often more system calls for the same amount of I/O, compared with threaded I/O, and rarely fewer calls. It depends on the pattern of I/O how much more.

Consider: with async I/O, non-blocking read() on a socket will return -EAGAIN sometimes, then you need a second read() to get the data later, and a bit more overhead for epoll or similar. Even for files and recent syscalls like preadv2(...RWF_NOWAIT), there are at least two system calls if the file is not already in cache.

Whereas, threaded I/O usually does one system call for the same results. So one blocking read() on a socket to get the same data as the example above, one blocking preadv() to get the same file data.

Every system call is two user<->kernel transitions (entry, exit). The number of these transitions is one of the things we're talking about reducing with async/await style userspace scheduling.

Threaded I/O puts all context switches in kernel space, but these add zero user<->kernel transitions, because all the context switches happen inside an existing I/O system call.

Another way of looking at it, is async replaces every kernelspace context switche with a kernel entry/exit transition pair instead, plus a userspace context switch.

So the question becomes: Does the speed of userspace context switches plus kernel entry/exit costs for extra I/O system calls compare favourably against kernel context switches which add no extra kernel entry/exit costs.

If the kernel scheduler is fast inside the kernel, and kernel entry/exit is slow, this favours threaded I/O. If the kernel scheduler is slow even inside the kernel (which it certainly used to be in Linux!), and kernel entry/exit for I/O system calls is fast, it favours async.

This is despite userspace scheduling and context switching usually being extremely fast if done sensibly.

Everything above applies to async I/O versus threaded I/O and counting user<->kernel transitions, assuming them to be a significant cost factor.

The argument doesn't apply to async that is not being used for I/O. Non-I/O async/await is fairly common in some applictions, so that tilts the balance to userspace scheduling, but nothing precludes using a mix of scheduling methods. In fact doing blocking I/O in threads, "off to the side" of an async userspace scheduler is a common pattern.

It also doesn't apply when I/O is done without system calls. For example memory-mapped I/O to a device. Or if the program has threads communicating directly without entering the kernel. io_uring is based on this principle, and so are other mechanisms used for communicating among parallel tasks purely in userspace using shared memory, lock-free structures (urcu etc) and ringbuffers.