I'm not surprised by rhe reuslt. Another commenter said:
> async I/O is faster because it avoids context switches and amortizes kernel crossings
I think this is widely believed, but it's not particularly true for async I/O (of the coroutine kind meant by async/await in Python, NodeJS and other languages, rather than POSIX AIO).
With non-blocking-based async I/O, there are often more system calls for the same amount of I/O, compared with threaded I/O, and rarely fewer calls. It depends on the pattern of I/O how much more.
Consider: with async I/O, non-blocking read() on a socket will return -EAGAIN sometimes, then you need a second read() to get the data later, and a bit more overhead for epoll or similar. Even for files and recent syscalls like preadv2(...RWF_NOWAIT), there are at least two system calls if the file is not already in cache.
Whereas, threaded I/O usually does one system call for the same results. So one blocking read() on a socket to get the same data as the example above, one blocking preadv() to get the same file data.
Every system call is two user<->kernel transitions (entry, exit). The number of these transitions is one of the things we're talking about reducing with async/await style userspace scheduling.
Threaded I/O puts all context switches in kernel space, but these add zero user<->kernel transitions, because all the context switches happen inside an existing I/O system call.
Another way of looking at it, is async replaces every kernelspace context switche with a kernel entry/exit transition pair instead, plus a userspace context switch.
So the question becomes: Does the speed of userspace context switches plus kernel entry/exit costs for extra I/O system calls compare favourably against kernel context switches which add no extra kernel entry/exit costs.
If the kernel scheduler is fast inside the kernel, and kernel entry/exit is slow, this favours threaded I/O. If the kernel scheduler is slow even inside the kernel (which it certainly used to be in Linux!), and kernel entry/exit for I/O system calls is fast, it favours async.
This is despite userspace scheduling and context switching usually being extremely fast if done sensibly.
Everything above applies to async I/O versus threaded I/O and counting user<->kernel transitions, assuming them to be a significant cost factor.
The argument doesn't apply to async that is not being used for I/O. Non-I/O async/await is fairly common in some applictions, so that tilts the balance to userspace scheduling, but nothing precludes using a mix of scheduling methods. In fact doing blocking I/O in threads, "off to the side" of an async userspace scheduler is a common pattern.
It also doesn't apply when I/O is done without system calls. For example memory-mapped I/O to a device. Or if the program has threads communicating directly without entering the kernel. io_uring is based on this principle, and so are other mechanisms used for communicating among parallel tasks purely in userspace using shared memory, lock-free structures (urcu etc) and ringbuffers.
> async I/O is faster because it avoids context switches and amortizes kernel crossings
I think this is widely believed, but it's not particularly true for async I/O (of the coroutine kind meant by async/await in Python, NodeJS and other languages, rather than POSIX AIO).
With non-blocking-based async I/O, there are often more system calls for the same amount of I/O, compared with threaded I/O, and rarely fewer calls. It depends on the pattern of I/O how much more.
Consider: with async I/O, non-blocking read() on a socket will return -EAGAIN sometimes, then you need a second read() to get the data later, and a bit more overhead for epoll or similar. Even for files and recent syscalls like preadv2(...RWF_NOWAIT), there are at least two system calls if the file is not already in cache.
Whereas, threaded I/O usually does one system call for the same results. So one blocking read() on a socket to get the same data as the example above, one blocking preadv() to get the same file data.
Every system call is two user<->kernel transitions (entry, exit). The number of these transitions is one of the things we're talking about reducing with async/await style userspace scheduling.
Threaded I/O puts all context switches in kernel space, but these add zero user<->kernel transitions, because all the context switches happen inside an existing I/O system call.
Another way of looking at it, is async replaces every kernelspace context switche with a kernel entry/exit transition pair instead, plus a userspace context switch.
So the question becomes: Does the speed of userspace context switches plus kernel entry/exit costs for extra I/O system calls compare favourably against kernel context switches which add no extra kernel entry/exit costs.
If the kernel scheduler is fast inside the kernel, and kernel entry/exit is slow, this favours threaded I/O. If the kernel scheduler is slow even inside the kernel (which it certainly used to be in Linux!), and kernel entry/exit for I/O system calls is fast, it favours async.
This is despite userspace scheduling and context switching usually being extremely fast if done sensibly.
Everything above applies to async I/O versus threaded I/O and counting user<->kernel transitions, assuming them to be a significant cost factor.
The argument doesn't apply to async that is not being used for I/O. Non-I/O async/await is fairly common in some applictions, so that tilts the balance to userspace scheduling, but nothing precludes using a mix of scheduling methods. In fact doing blocking I/O in threads, "off to the side" of an async userspace scheduler is a common pattern.
It also doesn't apply when I/O is done without system calls. For example memory-mapped I/O to a device. Or if the program has threads communicating directly without entering the kernel. io_uring is based on this principle, and so are other mechanisms used for communicating among parallel tasks purely in userspace using shared memory, lock-free structures (urcu etc) and ringbuffers.