Linux: What Can You Epoll?

Matthias247 · on Oct 22, 2022

> On Linux write to a regular file never blocks. Writing to a file copies data from our user space buffer to the kernel buffer and returns immediately. At some later point in time the kernel will send it to the disk. A regular file is hence always ready for writing and epoll wouldn’t add anything.

Is that true? If it would be, the amount of data the Kernel would need to buffer is unbounded. I assumed there is a limit on the amount of buffered and not yet committed data, and when that is cross the call would block until more data is flushed to disk. Which is kind of the same as happens for TCP sockets. The `write()` call there doesn't really send data to the peer, it just submits data to the kernels send buffer, from where it will be asynchronously transmitted.

Edit: Actually I will answer my own question and say I know it will block. I had deployed IO heavy applications in the past with instrumented read/write calls for IO operations in a threadpool. Even though typical IO times are well below 1ms, under extremely high load latencies of more than 1s could be observed, which is far from "not blocking".

kentonv · on Oct 22, 2022

Yes, file I/O can block. However, there is an assumption that file I/O will never block "indefinitely" -- unless something is severely broken, the kernel will always finish the operation in finite time probably measured in milliseconds at most. The same is not true of network communications, where you may be waiting for an event that never happens.

There is a temptation to say that, well, milliseconds are a long time, so wouldn't we like to do this in a non-blocking way so we can work on other stuff in the meantime?

But... consider this: Reads and writes of memory also may block. If you really think about it, the only real difference between main memory blocking and disk blocking is the amount of time they may block. And with modern SSDs that time difference is not as large as it used to be.

So do you want to be able to access memory in a non-blocking way? Well... you can make the same logical arguments as you do with file I/O, but in practice, almost no one tries to do this. Instead, you separate work into threads, and let the CPU switch (hyper)threads whenever it needs to wait for memory.

In fact, memory reads may very well block on disk, if you use swap!

Given all this, it stops being so clear that async file I/O really makes sense.

Meanwhile, as it happens, the Linux kernel was never really designed for async file I/O in the first place. When you perform file I/O, the kernel may need to execute filesystem driver code, and it does so within the same thread that invoked the operation from userspace. That filesystem code is blocking. For the kernel to deliver true async file I/O, either all this code needs to be rewritten to be non-blocking (which would probably slow it down in most cases!), or the kernel needs to start a thread behind the scenes to perform the work.

But... you can just as easily start a thread in userspace. So... maybe just do that?

(Or, the modern answer: Use io_uring, which is explicitly designed to allow a userspace thread to request work performed on a separate kernel thread, and get notified of completion later.)

skissane · on Oct 23, 2022

> However, there is an assumption that file I/O will never block "indefinitely" -- unless something is severely broken, the kernel will always finish the operation in finite time probably measured in milliseconds at most. The same is not true of network communications, where you may be waiting for an event that never happens.

Doesn’t this statement assume all filesystems are local? If the filesystem is remote (NFS, CIFS/SMB, WebDAV or SSH via FUSE, etc), then regular file IO is the same thing as network communication - and an NFS server going down (or losing connectivity to it) no more counts as “severely broken” than the same scenario for any other network protocol/application/service.

kentonv · on Oct 23, 2022

A network filesystem is still expected to respond to reads and writes in a short finite time, assuming the network isn't broken.

For an arbitrary network protocol, the amount of time you wait for an event is arbitrary and application-dependent. Even if everything is working exactly as intended, you could be waiting indefinitely for an event.

skissane · on Oct 23, 2022

The network could be completely fine but the NFS server could have crashed (and maybe the backups were faulty so it is never coming back). That’s absolutely no different than if the same thing happened to an SSH server or a HTTP server or any other network service. The assumption is no more valid for NFS than it is for an arbitrary protocol.

izacus · on Oct 23, 2022

That's not even true on devices that use slow flash storage, much less network mounted file systems.

These assumptions tell a lot about why Linux so horribly falls apart when something like NFS lags a bit though.

throwaway09223 · on Oct 23, 2022

> However, there is an assumption that file I/O will never block "indefinitely"

No, there absolutely is not.

> unless something is severely broken, the kernel will always finish the operation in finite time probably measured in milliseconds at most

Systems get busy all the time and encounter delays larger than this all the time. I have fixed many supposedly low latency systems which incorrectly (as you suggest) assume that local disk writes ought to be fast because usually it hits a disk cache. It's very, very easy to break this assumption, like by copying a moderately sized file.

You cannot assume that a disk write will finish quickly. That is not the guarantee.

> Reads and writes of memory also may block.

Not really, no. There are many orders of magnitude of difference.

> If you really think about it, the only real difference between main memory blocking and disk blocking is the amount of time they may block.

All syscalls take time, but the difference here is many orders of magnitude. If a syscall is expected to take a meaningful (or in the case of i/o, infinitely variable) amount of time we label it a blocking call.

> In fact, memory reads may very well block on disk, if you use swap!

System calls can't swap. High latency jobs will often mark themselves unswappable for exactly this reason, or systems will simply run swapless.

> Given all this, it stops being so clear that async file I/O really makes sense.

No. No no no. This is flatly wrong. You will ship broken systems if you do this and people will have to call someone who knows what they're doing to fix it.

The best you can say is that statistically it won't happen often and maybe your users (or, management) won't notice until it's not your problem anymore.

> But... you can just as easily start a thread in userspace. So... maybe just do that?

Async i/o in glibc was historically implemented as simply a separate thread, fyi (prior to io_uring - I haven't looked in a decade). This is a form of async i/o.

anonymoushn · on Oct 23, 2022

> > Reads and writes of memory also may block.

> Not really, no. There are many orders of magnitude of difference.

If the kernel has locked your pages so it can merge or unmerge them under transparent huge pages or if you are using swap, page faults can take a really long time. People who care about async file I/O are probably already ensuring that these circumstances don't come up though.

abiloe · on Oct 22, 2022

> If you really think about it, the only real difference between main memory blocking and disk blocking is the amount of time they may block.

This is a somewhat confusing analysis you have here. Direct read/write from memory for all intents and purposes doesn't block. Why do you say that reads and writes may also block?

The reason memory blocks is because it needs to page in or out from secondary storage - which makes this statement "the only real difference between main memory blocking and disk blocking is the amount of time they may block." not really true

kentonv · on Oct 22, 2022

> Direct read/write from memory for all intents and purposes doesn't block.

Sure it does! Main memory is much slower than cache so on a cache miss the CPU has to stop and wait for main memory to respond. The CPU may even switch to executing some other thread in the meantime (that's what hyperthreading is). But if there isn't another hyperthread ready, the CPU sits idle, wasting resources.

It's not a form of blocking implemented by the OS scheduler, but it's pretty similar conceptually.

> The reason memory blocks is because it needs to page in or out from secondary storage

Nope, that's not what I was referring to (other than in the line mentioning swap).

abiloe · on Oct 22, 2022

> Sure it does! Main memory is much slower than cache so on a cache miss the CPU has to stop and wait for main memory to respond. The CPU may even switch to executing some other thread in the meantime (that's what hyperthreading is).

Cache is a memory. And which cache, by the way? Even L1 cache on modern processors doesn't have 0 latency. And this is a rather poor way of describing hyperthreading - the CPU doesn't really "switch" - the context for the alternate process is already available and the resource stealing can occur for any kind of stall (including cache loads), not just memory. Calling this a "switch" suggesting it is like a context switch is very misleading. It's not similar conceptually.

In any event, by this definition even a mispredicted branch or a divide becomes "blocking" - which sort of tortures any meaningful definition of blocking.

The essential difference is - memory accesses to paged in memory (and branch mispredictions, cache misses) are not something you typically or reasonably trap outside of debugging. mmaps, swaps, disk I/O, network accesses are all something delegated to an OS - and at that point are orders of magnitude more expensive than even most NUMA memory accesses. I sort of see where you're coming from - but I don't think it's a useful point.

kentonv · on Oct 22, 2022

None of this seems to contradict my point?

My argument is that disk I/O is more like memory I/O than it is like network I/O, and so for concurrency purposes it may make more sense to treat it like you would memory I/O (use threads) than like you would network I/O (where you'd use non-blocking APIs and event queues).

abiloe · on Oct 22, 2022

> My argument is that disk I/O is more like memory I/O than it is like network I/O

It depends on your network and disk - and yes SSD and "slow" ethernet are the common case, but there is enough variation (say an relatively sluggish embedded eMMC on one end and 100 GbE for the networking case), that there's no point in making some distinction about disk vs network latency - for a general concurrency abstraction they're both slow IO and you might as well have a common abstraction like IOCP or io_uring.

> concurrency purposes it may make more sense to treat it like you would memory I/O (use threads) than like you would network I/O (where you'd use non-blocking APIs and event queues).

No, case in point, Windows had IOCP for years such that you could use the same common abstraction for network and disk. The fact that the POSIX/UNIX world was far behind the times in getting its shit together doesn't mean much.

And why, fundamentally, can you not use blocking APIs with threads for networking?

kentonv · on Oct 22, 2022

> distinction about disk vs network latency

Ah but that's not the distinction I'm making at all. This has nothing to do with the physical latency of the network, which can easily be less than that of a disk. It's about the application-level semantics of your communication.

When you are waiting for a connection to arrive on a listen socket, you have no idea how long you might wait. It might never happen. Someone elsewhere on the network has to take the initiative to connect. Depending on the protocol, the same can be true of waiting for a read or even a write.

> Windows had IOCP

I think Windows is over-engineered in this respect. It doesn't seem to have given Windows any recognized fundamental advantage in running databases.

> And why, fundamentally, can you not use blocking APIs with threads for networking?

Well, you can, especially if you're only talking to a single peer. But when you are talking to multiple parties (especially as part of the same logical task, e.g. talking to multiple back-ends to handle one incoming request) it gets unwieldy to manage a thread for each connection.

jandrewrogers · on Oct 23, 2022

I think the important practical distinction is whether or not these stalls imply a context switch. Software that avoids blocking calls is largely trying to minimize context switches, with measurable adverse effects being increasingly common due to improvements in hardware. Stalls that do not imply context switches, such as filling a cache line, are not "blocking" as a matter of practical semantics because there is no context switch that has to be accounted for. Of course, this gets into a gray area with things like hyper-threading that have some of the side-effects of a context switch without an actual context switch.

bch · on Oct 22, 2022

With the utmost respect, I’ve never heard “blocking” described as “takes some measurable amount of time” (which is how I’m reading your above statement); by that definition, async blocks to a degree too.

You’re throwing traditional blocking/non-blocking distinctions on their ear.

Volundr · on Oct 22, 2022

Blocking in this case is referring to the CPU thread sitting idle whilst the operation is performed. This is what it means when your blocked on a network request, blocked on a disk operation, or blocked on a memory request. It's all blocking.

A cache miss and going to RAM is usually fast enough that we as software engineers don't care about it, and in fact our programming language of choice may not even give us a way of telling the difference between a piece of data coming from a CPU register or L1 cache vs going to RAM, but that doesn't mean the blocking isn't happening.

EDIT: to maybe make this a little clearer for those who might not be aware the CPU doesn't go fetch something from RAM. It puts in a request to the memory controller (handwaving modern architecture a bit here) then has to wait ~100-1000 CPU cycles before the controller gets back to it with the data. Depending on the circumstances the kernel may let that core sit idle, or it may do a context switch to another thread. The only difference between this process and say a network request is how many CPU cycles before you get your results. In the meantime the thread isn't progressing and is blocked.

bch · on Oct 22, 2022

> A cache miss and going to RAM is usually fast enough that we as software engineers don't care about it, and in fact our programming language of choice may not even give us a way of telling the difference between a piece of data coming from a CPU register or L1 cache vs going to RAM, but that doesn't mean the blocking isn't happening.

Yes, this is the line being discussed, and I guess (historically) I’ve just considered “a cost” without dragging “blocking” into the equation. We know that not accessing memory is cheaper than accessing it, and we can tune (pack our structs, mind thrashing the cache), but calling that blocking is still new to me. I’ll have to consider what it means. And also, does it imply the existence of non-blocking memory (I don’t think DMA is typically in a developers toolkit, but…)?

jmalicki · on Oct 22, 2022

> And also, does it imply the existence of non-blocking memory

Prefetching instructions, to tell the processor to load before you use it!

The first google hit [0] even calls it non-blocking memory access!

In [1] you can see some of the available prefetching instructions, and in [2] some analysis on how they deal with TLB misses (another extremely expensive way memory access can be blocking short of a page fault).

Another thing not mentioned above is that accessing a page of newly allocated memory often causes a page fault, since allocation is often delayed until use of each page, for overcommitting behavior - same for writing to memory that is copy-on-write from a fork!

[0] https://www.sciencedirect.com/topics/computer-science/prefet....

[1] https://docs.oracle.com/cd/E36784_01/html/E36859/epmpw.html

[2] https://stackoverflow.com/a/52377359/435796

Volundr · on Oct 22, 2022

> And also, does it imply the existence of non-blocking memory

Yes actually! If you know your going to need a block of memory before you actually need it, you can put in a request to the memory controller before you need it, then proceed to do some other work and check back in when your ready for the data or when the memory controller signals you it's done. It's just that this kind of thing is usually the scope of compiler optimizations or hyper optimized software like Varnish cache rather than something your average web developer thinks about. It's again conceptually the same as an async network request, but you bother with one while considering the other just "a cost" because of the different timescales.

jamiek88 · on Oct 22, 2022

Is that the same thing as a prefetch?

Volundr · on Oct 23, 2022

kentonv · on Oct 22, 2022

OK but whether or not this is a proper definition of "blocking" doesn't really change my point.

Alternatively, maybe my point is that disk I/O is not "blocking" either, it just "takes some measurable amount of time". :)

EDIT: What this says: https://news.ycombinator.com/item?id=33302587

YZF · on Oct 22, 2022

Historically I/O or peripherals were generally distinct from different types of memory.

Technically the CPU also "blocks" when executing an instruction, let's say adding two numbers, because obviously the sum isn't available in 0 cycles. One might imagine semantics where you explicitly ask some CPU unit to add some numbers and then wait for the result and having blocking or non-blocking wrappers around it but it sort of becomes moot. Generally async/non-blocking operations are abstractions over I/O peripherals that have an async nature, e.g. you submit something and you get some reply later, or you wait for a network packet to come in etc. Reply is often an interrupt.

I agree the lines have blurred a little with modern CPUs (and arguably were always blurry with some peripherals being memory mapped) but something like waiting on a packet to come in or a disk operation to complete is outside the box that you'd draw around the CPU and its more tightly integrated components.

tremon · on Oct 22, 2022

Why do you say that reads and writes may also block?

Let's define "may block" first, perhaps? What do we mean when we say "network I/O may block"? Usually, this means that the kernel may see your network request and raise you a context switch while it waits for the network response on your behalf. In your last sentence you appear to argue that the reason why the kernel performs a context switch is relevant in determining if an operation "may block", and the GP is arguing that that's a distinction without a difference.

If the definition of "may block" is really just "the kernel may decide to context-switch away from your program", then yes, the GP's assertion that file I/O, memory I/O (mmap) and memory access (swap) are all operations that may block is correct -- the only difference is in degree: from microsecond delays for nvm-backed swap to multi-second delays for network transfers.

Or, of course, I may have misunderstood the GP's train of thought.

jesboat · on Oct 22, 2022

>> If you really think about it, the only real difference between main memory blocking and disk blocking is the amount of time they may block. > > This is a somewhat confusing analysis you have here. Direct read/write from memory for all intents and purposes doesn't block. Why do you say that reads and writes may also block?

Reads and writes from actual, physical, hardware memory might block, depending on how you define "block", in the sense that some reads may miss CPU cache. But once you get to that point, you could argue that every branch might block if the branch misprediction causes a pipeline stall. This is not a useful definition of "block".

The thing is, most programs are almost never low-level enough to be dealing with memory in that sense: they read and write virtual memory. And virtual memory can block for any number of reasons, including some pretty non-obvious ones like. For example:

- the system is under memory pressure and that page is no longer in RAM because it got written to a swap file

- the system is under memory pressure and that page is no longer in RAM because it was a read-only mapping from a file and could be purged

-- e.g. it's part of your executable's code

- this is your first access to a page of anonymous virtual memory and the kernel hadn't needed to allocate a physical page until now

- you're in a VM and the VMM can do whatever it wants

- the page is COW from another process

kentonv · on Oct 22, 2022

> This is not a useful definition of "block".

I think what I'm saying is that calling file I/O "blocking" is also not a useful definition of "block". Because I don't really see the fundamental difference between "we have to wait for main memory to respond" and "we have to wait for disk to respond".

> this is your first access to a page of anonymous virtual memory and the kernel hadn't needed to allocate a physical page until now

And said allocation could block on all sorts of things you might not expect. Once upon a time I helped debug a problem where memory allocation would block waiting for the XFS filesystem driver to flush dirty inodes to disk. Our system generated lots of dirty inodes, and we were seeing programs randomly hang on allocation for minutes at a time.

abiloe · on Oct 22, 2022

> I think what I'm saying is that calling file I/O "blocking" is also not a useful definition of "block". Because I don't really see the fundamental difference between "we have to wait for main memory to respond" and "we have to wait for disk to respond".

In addition to the point elsewhere made that you're sort of implicitly denying the magnitude of the differences here - the latency differences are on the order of 1000s.

The other way of separating is if the OS (or some kind of software trap handler more generally) has to get involved. A main memory read to a non-faulting address doesn't involve the OS - ie it doesn't ever block. However faulting reads, calls to "disk" IO, and networking IO (ie just I/O in general) involving the OS/monitor/what have you are all potentially blocking operations.

cout · on Oct 23, 2022

It does not matter whether the OS is involved. Consider a spinlock; if it is spinning, waiting on the lock to be released, then it is blocking.

What matters is whether control returns to the process before the operation is complete. If the process waits, it is blocking (aka synchronous); if the process does not wait, it is non-blocking (and possibly also asynchronous if it checks later to see if the operation succeeded).

dahfizz · on Oct 22, 2022

> Because I don't really see the fundamental difference between "we have to wait for main memory to respond" and "we have to wait for disk to respond".

The difference, conservatively, is a factor of 1000.

There are plenty of times in software engineering where scaling 1000x will force you to reconsider your architecture.

kentonv · on Oct 23, 2022

Sure, fair enough.

To be clear I do not believe that async disk I/O is never useful, I just think that it's not as useful as people at first imagine when they learn about async I/O.

Yes, it may be 1000 times slower than memory. But there's a fundamental paradigm difference from network events, in that with network events you are waiting for some other entity to take action, with no implicit expectation that they will do so in any particular timeframe. Like, if you're waiting for connections on a listen socket, there's no telling how long you will be waiting.

Disk I/O is fundamentally different in that once you submit an operation, you expect it to complete within a reasonable, finite time period.

jandrewrogers · on Oct 23, 2022

Async disk I/O is primarily useful for implementing read-ahead / write-behind scheduling behaviors. While databases tend to be the obvious use case, the OS is often so poor at this that there are large performance improvements even for much simpler use cases that are otherwise disk I/O intensive.

wtallis · on Oct 23, 2022

I'm not sure that's the primary use case any more. Fast SSDs require high queue depths to use their full throughput, so async IO is desirable to use any time an application knows it has several IO requests to issue in parallel—one thread per request has too much overhead.

jandrewrogers · on Oct 23, 2022

Sure, but that behavior is effectively read-ahead / write-behind on your I/O buffers. That doesn't mean much more than anticipating future I/O operations before completion of that I/O is required by the code for efficient forward progress.

wtallis · on Oct 23, 2022

They're really not equivalent. Read-ahead only helps for predictable IO patterns. Issuing multiple read requests in parallel from the application is useful in a far broader range of scenarios. And for both reads and writes, being able to submit IO in batches (without having to wait for the entire batch to complete) can drastically cut down on overhead compared to submitting IOs sequentially as if they were a linear dependency chain, and makes it possible to keep the storage properly busy instead of it idly waiting on the host software to prepare and submit the next IO.

jandrewrogers · on Oct 23, 2022

All cache replacement algorithms are literally equivalent to universal sequence prediction problems, per the optimality theorem. There is no implication of sequential decisions here. When you schedule a batch of disk I/O, you are essentially front-running the sequence predictor to avoid classes of prediction failure where successful prediction would be computationally intractable (and therefore not implemented in real systems), which is expected to produce better I/O throughput on average if done competently per the same theory. There is nothing magic about this, it is in the literature, and databases in particular have explicitly exploited non-sequential scheduling to circumvent fundamental sequence prediction limits for decades. Optimally anticipating future requirements for reads and writes can be called whatever you like, but that remains the primary use case for async I/O since you can't do it with blocking I/O in a single thread.

This becomes more important as caches become larger because cache efficiency increases are strongly sublinear as a function of size, as expected. Servers are already at the scale where very deep async I/O scheduling is required for consistent throughput with high storage density, beyond what can be done via traditional buffered disk I/O architectures, async or not. It is an active area of research with some interesting ideas.

cout · on Oct 23, 2022

The reason memory access can block is because it can cause the page fault handler to be invoked (https://www.kernel.org/doc/gorman/html/understand/understand...). There are many reasons the page fault handler might cause the process to block. The kernel made need to swap the page in from disk, or it might be a copy-on-write page that was just written to.

To copy the page, the kernel needs to first find a free page frame. If there are no free page frames, the kernel will attempt to reclaim pages that are in use (https://www.kernel.org/doc/gorman/html/understand/understand...). This may cause in process-mapped pages being swapped to disk, but it also may result in disk writeback activity (https://lwn.net/Articles/396561/). In either case, control cannot be returned to the process until there is a free page to map into the process.

p12tic · on Oct 22, 2022

It's complicated, memory accesses can really block for relatively long periods of time.

Consider that regular memory access via cache takes around 1 nanosecond.

If the data is not in top-level cache, then we're looking at roughly 10 nanoseconds access latency.

If the data is not in cache at all, we are looking into 50-150 nanoseconds access latency.

If the data is in memory, but that memory is attached to another CPU socket, it's even more latency.

Finally, if the data access is via atomic instruction and there are many other CPUs accessing the same memory location, then the latency can be as high as 3000 nanoseconds.

It's not very hard to find NVMe attached storage that has latencies of tens of microseconds, which is not very far off memory access speeds.

slashdev · on Oct 22, 2022

I just want to add to your explanation, that even in the absence of hard paging from disk, you can have soft page faults where the kernel modifies the page table entries or assigns a memory page, or copies a copy on write page, etc.

In addition to the cache misses you mention there's also TLB misses.

Memory is not actually random access, locality matters a lot. SSDs reads, on the other hand, are much closer to random access, but much more expensive.

loeg · on Oct 22, 2022

> For the kernel to deliver true async file I/O, either all this code needs to be rewritten to be non-blocking

This is, I believe, the NT model.

jeffbee · on Oct 22, 2022

io_uring just racked up another CVE, so I kinda feel that its severely under-designed nature will always haunt it. The idea that you can just hand off infinite amounts of work for the kernel to do on your behalf is pretty fundamentally broken. It is a concrete implementation of wishful thinking.

vlovich123 · on Oct 22, 2022

That analysis would seem smart but let's try a game of Mad Libs:

The Linux Kernel just racked up another CVE, so I kinda feel that its severely under-designed nature will always haunt it.

KDE just racked up another CVE, so I kinda feel that its severely under-designed nature will always haunt it.

Firefox just racked up another CVE, so I kinda feel that its severely under-designed nature will always haunt it.

Chrome just racked up another CVE, so I kinda feel that its severely under-designed nature will always haunt it.

Windows just racked up another CVE, so I kinda feel that its severely under-designed nature will always haunt it.

Photoshop just racked up another CVE, so I kinda feel that its severely under-designed nature will always haunt it.

All CPUs just rucked up another CVE, so I kinda feel that its severely under-designed nature will always haunt it.

What's the theme? Racking up CVEs is something all software & hardware does. Mistakes can happen in design and in implementation and no one is immune. Using presence of CVEs as an indication of immaturity / fundamental design flaw isn't helpful. In fact, it's probably the opposite. Software that has no CVEs probably just means no one is paying attention to it. Sure, in a theoretical case maybe you've built a formal proof and translated that into a memory safe language somehow (& you assume you've made no mistakes modelling your entire system in your proof), then maybe. However, that encompasses 0% of all software.

> The idea that you can just hand off infinite amounts of work for the kernel to do on your behalf is pretty fundamentally broken. It is a concrete implementation of wishful thinking

How is that any different from a file descriptor? The kernel is free to setup limits on how much work you can have outstanding at any given time (now maybe those bits are missing right now, but it doesn't feel like an intractable problem).

tankenmate · on Oct 22, 2022

All "work" you want to do that interfaces with anything on an OS is handed off to the kernel; want to read a file? kernel, want to sleep for a while? kernel, etc. Besides things like network traffic is also asynchronous like io_uring (even if the socket() interfaces make it look somewhat synchronous). Outside of toy system asynchronicity is always a thing, especially when running on multiple cores.

I kind of get where you are coming from but at the same time, the kernel always gets the last say, so as long as io_uring has a good design and implementation it will always be just as good or bad as the OS as a whole. Whether run of the mill programmers are up to the task of being able to properly conceptualise and use such an OS is probably not the same thing.

jeffbee · on Oct 22, 2022

Yeah but it's not well-designed, that's my point. It has obliviously shrugged off the tricky question of object lifetime, that's why it has already collected 16 different CVEs for things like use-after-free. Considering its short history, io_uring has already rocketed to the top of the list of dangerous kernel features.

nathants · on Oct 22, 2022

with linux 6.0, lsm got the ability to filter io_uring. deny all and carry on.

insanitybit · on Oct 23, 2022

Can you provide a source? Eager to read about how this is done.

nathants · on Oct 23, 2022

https://github.com/torvalds/linux/blob/d47136c2801540e80f41e...

insanitybit · on Oct 23, 2022

Are they just going to keep adding hooks to every new system call implemented for io_uring?

nathants · on Oct 23, 2022

my understanding is that this hook is the only one, but to properly secure io_uring one has to implement a check for every call here. or just disable it.

caf · on Oct 22, 2022

The term "blocking" in UNIX-like OSes is jargon with a particular meaning. It means an interruptible wait.

Disk files do not block - they may Disk Wait instead, which is an uninterruptible wait (this is what the 'D' process status stands for). Disk Wait doesn't interact with O_NONBLOCK, select(3), poll(3) etc.

(Back in the bad old days it wasn't even possible for a Disk Waiting process to wake up to process a SIGKILL and die, which was the bane of system administrators everywhere when NFS introduced the idea of disks that could disappear when the network went down. Now it's common for OSes to make some kinds of Disk Waits at least killable).

pram · on Oct 23, 2022

Yep, last time I janitored NFS at scale anything stuck in D required an entire system reboot.

Snild · on Oct 22, 2022

> I assumed there is a limit on the amount of buffered and not yet committed data, and when that is cross the call would block until more data is flushed to disk.

There is. It's tunable through /proc/sys/vm/dirty_ratio. When there is that much write cache, application writes will start to writeback synchronously.

There is also dirty_background_ratio, which is the threshold at which writeback starts happening in the background (that is, in a kernel thread).

cout · on Oct 22, 2022

It is bounded by available memory. Writes to a socket go to a FIFO queue (the socket's write buffer), but writes to disk are different; they go through the page cache (https://www.kernel.org/doc/html/latest/admin-guide/mm/concep...):

> The physical memory is volatile and the common case for getting data into the memory is to read it from files. Whenever a file is read, the data is put into the page cache to avoid expensive disk access on the subsequent reads. Similarly, when one writes to a file, the data is placed in the page cache and eventually gets into the backing storage device. The written pages are marked as dirty and when Linux decides to reuse them for other purposes, it makes sure to synchronize the file contents on the device with the updated data.

There are many advantages to doing it this way. One is that multiple writes to the same page will result in a single physical write, if the page has not yet been flushed to disk.

There are many reasons that you might have seen a write to a file block. One is that the number of dirty pages has reached the threshold (nr_dirty_threshold in /proc/vmstat). After that happens, any process doing disk IO will block.

Another reason is memory pressure. Since all writes go through the page cache, the kernel must first allocate a page before the call to write(2) can be completed. If there are many pages in the page cache, this can take a long time (I once witnessed an old kernel bug cause all page allocations to result in kswapd attempting to reclaim pages, due to active pages being placed ahead of inactive pages in the LRU lists).

In general, if you are writing a lot to disk but you have no intention of reading it in the near future, it is a good idea to call posix_fadvise(2) with FADV_DONTNEED to ensure the pages will be reused for something else more quickly.

lanstin · on Oct 22, 2022

It is pretty easy to completely hork a large box with a very disk intensive process; hit a local file system hard enough and you can get a majority of the processes into D state, uninterruptible IO Disk wait. Maybe not from inside a container, haven’t see it, but definitely on a box with shared processes. Even just too much logging can harm unrelated processes that aren’t even doing much with the disk.

kotlin2 · on Oct 22, 2022

The write call returns how many bytes were accepted: https://man7.org/linux/man-pages/man2/write.2.html

> The number of bytes written may be less than count if, for example, there is insufficient space on the underlying physical medium, or the RLIMIT_FSIZE resource limit is encountered (see setrlimit(2)), or the call was interrupted by a signal handler after having written less than count bytes. (See also pipe(7).)

wtallis · on Oct 22, 2022

That doesn't answer the question. Blocking isn't a matter of how much data is written, but a matter of when the system call completes. Other parts of that man page imply that write(2) may block, unless the fd was opened with O_NONBLOCK (in which case you'll get an EAGAIN error instead of it blocking).

throwaway09223 · on Oct 22, 2022

> (in which case you'll get an EAGAIN error instead of it blocking).

You won't. O_NONBLOCK cannot be used with regular files. That part of the manpage is discussing other non-socket file types.

Disk i/o via write(2) is always a blocking call. Always. 100% of the time, no exceptions.

icedchai · on Oct 22, 2022

"It's complicated." Generally, with a regular file, write(2) will complete as soon as the the data makes it to filesystem buffers/cache. The data is probably not on disk when the call completes. This depends on how the file was opened (O_FSYNC, O_DIRECT, etc.) and the underlying filesystem itself. There are many other details at work, like actual file system, memory pressure (there may not be enough buffers), cache in the physical disk device or controller, etc. So the write call itself is "blocking", but the physical writes are (generally) not synchronous with the call.

wtallis · on Oct 22, 2022

Yes, whether a write blocks is really about whether the application can do anything else while the write is processed; whether the application is told the write is done when it lands in a cache or when it is actually on stable storage is a separate question.

fsckboy · on Oct 23, 2022

I don't know if I'm contributing anything, but to clarify by going back to basics (rather than advanced or non standard system calls),

all writes (and reads) are individually atomic, so if you and I are writing to the same file at the same time (in real time), how our overwriting behaves is relatively well behaved and can be relied on, so there is at least that level of atomic blocking, and that's important because

the kernel also buffers files transparently, in that our two processes will share the same buffers, so beyond atomicity in a shared piece of the file, we won't get blocked wrt each other. But your reads need to pick up my writes (atomicity!) so there isn't going to be buffered queuing of writes on top of the disk buffering.

so if the kernel decides it needs to spend more time syncing dirty buffers to disk, it's not blocking our processes in the regard above, but the system could become net i/o bound. This is 1/2 the reason to make sure to use virtual memory, so plenty of a flexible amount of disk buffer will be available (the other half being to swap out resident but no longer referenced dirty pages, as from the initialization code in your app)

throwaway09223 · on Oct 22, 2022

No, as you reasoned out it is absolutely incorrect. Write calls to regular files will block until they are complete, unless some kind of error situation is encountered.

This effect is often particularly pronounced with NFS, where calls might block for hours or even indefinitely if the underlying network filesystem goes away.

inetknght · on Oct 22, 2022

> This effect is often particularly pronounced with NFS, where calls might block for hours or even indefinitely if the underlying network filesystem goes away.

I can't tell you how many times I've had to debug a stuck process and it turns out that the logs indicated the NFS had a hiccup a day or two ago during a file read or write and the process was never notified of a file error. It's f!@#ing frustrating. Worse, though, was CIFS.

lanstin · on Oct 22, 2022

I routinely have to run file system scans on a giant NFS filer, and even without a hiccup, out of a 100M stat or read calls, ten or so will just never finish. In Go, I have to wrap the call with a channel thing and a time out and hope I don’t run out of threads before scanning all 400 M files.

tankenmate · on Oct 22, 2022

Just in case anyone isn't aware there is a mount flag called "soft" that allows the NFS client (and some other network filesystems) to timeout or be interrupted, i.e. the process won't get stuck in 'D' (device wait) state.

remram · on Oct 23, 2022

That is the other extreme, and is also wrong. write(2) does not block until the write to disk is completed, that's what we have fsync(2).

throwaway09223 · on Oct 23, 2022

No, that is incorrect. I think you're confused as to what "blocking syscall" means.

A blocking syscall doesn't always block, but it can always block because its interface is designed to block. Write(2) to a disk file can always block, 100% of the time. It is always considered a blocking syscall, even when it happens to not take very long due to cache optimizations.

Every single blocking syscall can sometimes return right away. This has no bearing on whether it is, or isn't, a potentially blocking call.

remram · on Oct 23, 2022

GGP was objecting to the fact that write always return immediately (writing to memory buffer, no back pressure), and GP said:

> Write calls to regular files will block until they are complete

I therefore understood "complete" to be the opposite to "only into buffers" in that context, e.g. all the way to disk. That is what I'm objecting to.

As to write(2) being a "blocking syscall" in the technical sense, I don't think I agree either. Of course it has that potential, but a write(2) to a regular file does not block in the sense of an interruptible block, or in the sense that O_NONBLOCK or select() can be used to avoid it. It is a "disk sleep" that is generally very different.

throwaway09223 · on Oct 23, 2022

Blocking calls often (perhaps usually) don't block. This has nothing to do with the definition.

> As to write(2) being a "blocking syscall" in the technical sense, I don't think I agree either.

Well, you are wrong.

> Of course it has that potential, but a write(2) to a regular file does not block in the sense of an interruptible block, or in the sense that O_NONBLOCK or select() can be used to avoid it.

These have nothing whatsoever to do with whether a syscall is blocking or not. You can't use them on wait(2) either, can you?

Please stop arguing about things you clearly know nothing about.

remram · on Oct 23, 2022

> Please stop arguing about things you clearly know nothing about.

You could have shown your true color outright and saved me the time. You don't care what I mean, you don't care what anybody means, you will just argue semantics to find a possible meaning where you're right. Well congrats.

throwaway09223 · on Oct 23, 2022

[flagged]

remram · on Oct 23, 2022

Sure, I will stop argumenting because you told me to. Thanks for the advice given while "you didn't care what I meant", it's as invaluable as you think.

graham_king_3 · on Oct 23, 2022

Thanks for starting a very interesting thread. I re-wrote part of that section based on all the feedback.

kentonv · on Oct 22, 2022

I highly recommend that you do NOT use signalfd to get notification of signals through epoll. Instead, block (mask) the signal, set a signal handler, and use epoll_pwait() to atomically unblock it while you wait for events. Note that in this setup, your signal handler callback need not be async-signal-safe, since you know the precise state of the calling thread: it's invoking epoll_pwait(). This sidesteps most of the pain of using signals which might otherwise make you think you want signalfd.

Two reasons not to use signalfd:

1. signalfd has weird semantics that don't match what you'd normally expect from a file descriptor. When you read from a signalfd, it tells you signals queued on the thread that called read(), NOT the thread that created the signalfd. Worse, if you add signalfd to an epoll, the epoll will report readiness based on the thread that used epoll_ctl() to add the signalfd, which may be different from the thread that is reading from the epoll. So you might get a notification that the signalfd is ready, but then read the signalfd and find there are no signals, and then wait on the epoll again just to have it tell you again that this signalfd is ready.

2. It turns out that signalfd's implementation has some severe lock contention issues. I learned this through my own experimentation recently. In my experiment, I had 5000 threads each waiting on an epoll that included a signalfd. When delivering a thread-specific signal to each of the 5000 threads at once, the process spent 2+ MINUTES of CPU time spinning on spinlocks in the kernel before completing all the event deliveries. The time spent was O(n^2) to the number of threads. When I switched to an epoll_pwait()-based implementation, the same task took a few milliseconds.

Here's the PR where I switched KJ's event loop (used in Cap'n Proto and Cloudflare Workers) to use epoll_pwait(): https://github.com/capnproto/capnproto/pull/1511

tlsalmin · on Oct 22, 2022

I have to disagree here. Not recommending signalfd for the mentioned use cases might be reasonable, just as reasonable as it is to use threads for a specific use case. For a single threaded non-blocking-FD using client/server signalfd removes the risk of doing too much in the signal handler and brings signals nicely into the event loop. This just happens to be 99% of the functionality I have to do.

I'd only use more than one signalfd if each signalfd only catches a specific signal. E.g. main context handles Sigterm and a background process library handles sigchld.

kentonv · on Oct 22, 2022

> removes the risk of doing too much in the signal handler

Not a concern here, since the signal handler is restricted from running at any time other than during epoll_pwait(), so the usual async-signal-safety concerns don't apply. In fact I think the code ends up cleaner than with signalfd, and there are fewer syscalls (no separate read() needed).

tlsalmin · on Oct 23, 2022

This is getting into taste territory but sure just getting the signal in epoll_pwait removes the need for FD reading. But introduces the need for the thread local global context and the need to handle EINTR from epoll right? I can see how checking if signals were called in the EINTR-branch is nice though.

Anyway nice to know the p-version use case.

cryptonector · on Oct 23, 2022

Using a self-pipe (signal handler writes one byte to one end, the other end is used in your event loop) is the most portable and reliable way to handle signals.

kentonv · on Oct 23, 2022

Yeah but it involves several extra syscalls, two file descriptors, and a kernel buffer that never has any actual data in it...

poll() + longjmp() out of the signal handler is actually surprisingly portable and much more efficient (but breaks all sorts of abstractions in ways that terrify people).

cryptonector · on Oct 24, 2022

The overhead is very limited. The memory overhead is fixed, and the compute overhead is linear with the rate at which signals arrive (which is typically super low). Plus the whole thing is very easy to reason about. And you never have to worry about masking signals that you handle with a self-pipe. Add in the portability, and it's a very clear win.

kelnos · on Oct 22, 2022

The big downside of using a traditional signal handler is that the only way to get your own data into the handler function is through global variables (or thread locals). While you can certainly make an exception just for that one thing, it feels gross to do so. And you can also just defer processing to your main loop by setting a flag or writing to a pipe, but those things still need to be global variables.

I didn't know about signalfd's limitations before reading your post, and was happy that signalfd could eliminate the need for global variables when doing signal handling. Shame that's not really the case.

kentonv · on Oct 22, 2022

In my case I use a thread_local pointer that I initialize right before epoll_pwait and set back to null immediately after. The pointer points to the same data structures that I would otherwise use to handle signalfd events. Yeah it's a little icky to use the global but I think it ends up semantically equivalent.

wahern · on Oct 22, 2022

Unfortunately, thread-local storage is not async-signal safe. You're relying (knowingly, I presume, but others should be warned) on implementation details.

But, yeah, signalfd leaves much to be desired. *BSD kqueue EVFILT_SIGNAL has much saner semantics.

kentonv · on Oct 22, 2022

> Unfortunately, thread-local storage is not async-signal safe.

Doesn't matter, because the signal handler in this case is strictly called "during" invocation of epoll_pwait, so there's no risk of it interrupting the initialization of a TLS object. The usual rules about async signal safety do not need to be followed here; it's as if epoll_wait()'s implementation made a plain old function call to the signal handler.

(Also, since we're talking about epoll, we can assume Linux, which means we can assume ELF, which means it's pretty easy to use thread_local in a way that requires no initialization by allocating it in the ELF TLS section. But yes, that's relying on implementation details I suppose.)

> kqueue EVFILT_SIGNAL

Having recently implement kqueue support in my event loop I have to say I'm disappointed by EVFILT_SIGNAL. It does not play well with signals that target a specific thread (pthread_kill()) -- on FreeBSD, all threads will get the kqueue event, while on MacOS, none of them do. Fortunately EVFILT_USER provides a reasonable alternative for efficient inter-thread signaling.

(I don't like using a pipe or socketpair as that involves allocating a whole two file descriptors and a kernel buffer, and it requires a redundant syscall on the receiving end to read the dummy byte out of the buffer. If you're just trying to tell another thread "hey I added something to your work queue, please wake up and check", that's a waste.)

wahern · on Oct 22, 2022

> Doesn't matter, because the signal handler in this case is strictly called "during" invocation of epoll_pwait, so there's no risk of it interrupting the initialization of a TLS object. The usual rules about async signal safety do not need to be followed here; it's as if epoll_wait()'s implementation made a plain old function call to the signal handler.

And if dlopen() is called in another thread, needing to (re)allocate TLS space and/or rebuild the index? You're relying on implementation details.

TLS is async-signal safe in musl libc, but musl libc also fudges (arguably) dlclose--it leaks memory as its async-signal safe data structures prevents it from deallocating per-module TLS space.

glibc may work by accident, or in the past 10 years or so they may have refactored things to work for your case. But, again, strictly speaking you're definitely relying on implementation details. (Which is perfectly acceptable as long as the dependency is transparent.)

> It does not play well with signals that target a specific thread (pthread_kill()) -- on FreeBSD, all threads will get the kqueue event, while on MacOS, none of them do.

That's because file descriptors are independent of threads--a thread doesn't "own" a descriptor--so the semantics of per-thread signals cannot cleanly map. (Multiple threads can wait on kqueue/epoll/signalfd descriptor, and moreover on both BSD and Linux you can install a kqueue/epoll descriptor as an event on another kqueue/epoll descriptor.) That Linux tries to fit a square peg into a round-hole with signalfd in this regard is probably related to the horrendous locking issues you encountered.

How are EVFILT_SIGNAL semantics better as a general matter? Among other things, it permits different libraries or components to listen for a signal without stepping on the toes of any other component. That's something that is simply impossible on Linux, period, where even signalfd is basically implemented as a signal handler--which is why you can't have both a signal handler and signalfd responding to a signal, not to mention multiple signalfd's responding to a signal.

There really aren't a lot of great solutions, here. OTOH, there aren't many use cases for per-thread signals, at least as originally conceived on Unix. EPIPE should be masked and handled in-band for anything with an event loop. SIGBUS and SIGSEGV must be handled by a signal handler, if at all. I'm sure you have your reasons, but at the end of the day it sounds like you're dealing with an issue at the nexus of signals, threading, and process semantics that cannot be correctly[1] resolved without a new, dedicated kernel API.

[1] Not strictly true. You could install a dedicated stack using sigaltstack for each thread, which could then be used to smuggle per-thread data to the signal handler, e.g. placing the data at a fixed offset from the stack given to sigaltstack. I did this once for an app that needed to longjmp from SIGSEGV--the SIGSEGV occurred by design when indexing off the end of an mmap'd array. Bounds checking and growing of the array inline was too slow as the constant inlined conditional checks destroyed pipelining. Catching SIGSEGV and longjmp'ing back to a safe point to regrow the array and then restarting the operation improved performance by some multiple > 2. Using the sigaltstack trick meant the library could be safely used from multiple threads, and without the rest of the application needing to know anything.

kentonv · on Oct 22, 2022

> And if dlopen() is called in another thread, needing to (re)allocate TLS space and/or rebuild the index? You're relying on implementation details.

This could also happen when I'm not in a signal handler, and the exact same mechanism that protects me in that case would also apply inside the signal handler (if the signal handler is restricted to running only during epoll_pwait()).

Again, a signal handler that is blocked from running except during epoll_pwait() has no special safety concerns different from those of a regular function call.

> That's because file descriptors are independent of threads--a thread doesn't "own" a descriptor--so the semantics of per-thread signals cannot cleanly map.

Well, file descriptors are independent of processes, too! You can send them to another process via parent->child inheritance or across a unix domain socket via SCM_RIGHTS. So what happens if you add EVFILT_SIGNAL in one process and then read the kqueue from another? Whose signals do you get?

Anyway, what they could have done is provided a way to specify, when adding EVFILT_SIGNAL, that it should notify only for signals deliverable to a particular thread.

wahern · on Oct 23, 2022

> Again, a signal handler that is blocked from running except during epoll_pwait() has no special safety concerns different from those of a regular function call.

You're absolutely correct. Mea culpa for objecting too soon.

I would've deleted my comment immediately after submitting as discussing my sigaltstack made me realize my error. But I got locked out by the procrastination setting. :(

kelnos · on Oct 22, 2022

Makes sense, and is probably the "safest" you can get. Since, as you say, you know exactly the state of everything on that thread when you're in your handler, you can also know that your thread local was set properly before the epoll_pwait() call.

It's probably code I'd want to isolate somewhere, with big warnings so any future reader understands why it is how it is, but I agree it's probably the safest way to do it.

cbsmith · on Oct 23, 2022

> The big downside of using a traditional signal handler is that the only way to get your own data into the handler function is through global variables (or thread locals). While you can certainly make an exception just for that one thing, it feels gross to do so. And you can also just defer processing to your main loop by setting a flag or writing to a pipe, but those things still need to be global variables.

That's not true. Ever since POSIX.4's real-time signals, we've had sigqueue(), which allows the you to attach an arbitrary sigval (integer/pointer union) that is passed to the signal handler, thereby allowing it to receive data without using globals/thread-local variables.

signalfd()'s signalfd_siginfo structure has two fields: ssi_int & ssi_ptr, which can receive said arbitrary data along with the signal. There was a brief period of time, when signalfd() was initially created, where those fields were not populated by the Linux kernel, but that hasn't been true since... (checks man page), 2.6.25 (and signalfd() has only been in the kernel since 2.6.22).

FPGAhacker · on Oct 22, 2022

You should do a write up of item 2.

tlsalmin · on Oct 22, 2022

Just skimmed through the article, since I'm just here to testify that the most important revelation for me on writing APIs was that you can put and epoll_fd in an epoll_fd. This allows the API to have e.g. a single epoll_fd that contains all outbound connections, timers, signalfds and inotifys mentioned in the articled. Then the e.g. daemon using the APIs can have an epoll_fd per library it is using and just be sitting in the epoll_wait loop ready to fire library_x_process() call when events arrive.

kentonv · on Oct 22, 2022

Another use case for this: Say you have a set of "jobs" each composed of many "tasks" (each waiting for some event). The "jobs" are able to run concurrently on different threads, but the "tasks" must not run concurrently with other tasks in the same job because they might share data structures without synchronization.

(This is a pretty common pattern in a lot of big servers.)

Now you want to make sure you utilize multiple cores effectively. The naive approaches are:

1. Create a thread per job, each waiting on its own epoll specific to the job. This may be expensive if there are many jobs, and could allow too much concurrency.

2. Have a single epoll and a pool of threads waiting on it. Each thread must lock a mutex for the job that owns the task it's going to run. But a thread could receive an event for a task belonging to a job that's already running on another thread, in which case it has to synchronize with that other thread somehow, which is a pain. Be careful not to create a situation where all threads are blocked on the mutex for one job while other jobs are starved.

Epoll nesting presents a clean solution:

3. Create an epoll per job, plus an outer epoll that waits on other epolls. A pool of threads waits on the outer epoll, which signals when a per-job epoll becomes ready. The thread receiving that event then takes ownership of the per-job epoll until the event queue is empty.

remram · on Oct 23, 2022

> The thread receiving that event then takes ownership of the per-job epoll until the event queue is empty.

What do you mean by that? Take that per-job epoll out of the outer epoll?

kentonv · on Oct 23, 2022

I'm handwaving a little because I haven't actually built something like this yet.

But I imagine you'd add the per-job epoll to the global epoll with EPOLLONESHOT, so that once an event is reported, it is unregistered with the global epoll. Whatever thread received that event then owns the job. When that thread decides there's nothing more to do, it adds the job epoll back to the global epoll with EPOLLONESHOT again.

rwmj · on Oct 22, 2022

It's weird that (according to this document) you can epoll Unix domain sockets but not sockets created by socketpair(2). I thought socketpair created essentially two pre-connected Unix domain sockets.

kentonv · on Oct 22, 2022

Hmm, I don't think that's it says (unless they edited it since your post?). It mentions socketpair explicitly as something that is epoll-friendly, and which you can use to communicate with another thread, in the case where you must create a thread to perform some blocking task but still want to get completion notification in the main thread via epoll.

ajross · on Oct 22, 2022

Indeed, I am all but certain you can epoll on socketpairs. That sounds like a mistake in the article.

graham_king_3 · on Oct 23, 2022

You can and they do. The article says:

    There are four epoll-able ways for processes or threads to communicate:
    - UNIX socket pairs: man socketpair.

bfrog · on Oct 22, 2022

why epoll at all, the new hotness is io_uring, fire away your iovecs, check back later

rwmj · on Oct 22, 2022

You can go from select/poll to epoll relatively easily, but I've found that to use io_uring you have to substantially rearchitect your whole program (if you want any performance benefit).

Actually I'd love to be wrong about this, but I've not found a way to easily retrofit io_uring into programs/libraries that are already using either synchronous operations or poll(2).

gavinray · on Oct 22, 2022

You can use userspace Coroutines/fibers implementations to wire in async io_uring into existing synchronous code and maintain the facade of the code still being synchronous

How easy/feasible this is depends on the language.

In C++, Rust, Zig, Java (Loom fibers), and Kotlin I know for a fact it's doable

Other languages I'm not sure what the experience is like

darksensei · on Oct 23, 2022

But in this case are you really reaping any benefit of io_uring, if your existing code is still synchronous (except your io_uring layer)?

It seems to me to see any substantial benefit, you'd have to propagate async throughout your codebase, which would require rewriting your existing code to use some async primitive. I guess something like Java Loom does help you mask this, however.

cbsmith · on Oct 23, 2022

It's all proactor vs. reactor. Basically the same as the differences between select/poll and AIO. If you're familiar with the two models, it's not THAT hard to shift between them, but I wouldn't deny that there is work involved.

jasonzemos · on Oct 22, 2022

io_uring is basically a drop-in for epoll. It has an intrinsic performance benefit because multiple operations can be both submitted and completed in a single action. Rearchitecting is only optional when going further by replacing standalone syscalls with io_uring operations. In the case of poll(2) I believe it should be no more difficult than refactoring for epoll.

wahern · on Oct 22, 2022

With io_uring, every line in an application that calls read/recv needs to be refactored, along with much of the surrounding context. io_uring doesn't replace poll/epoll, it effectively replaces typical event loop frameworks. You can integrate io_uring into pre-existing event loop frameworks, but the event loop framework will end up as a 99% superfluous wrapper, at least on Linux.

Note that many applications don't use event loop frameworks. For simple applications they can be overkill. Even for more complex applications, it may be cleaner to use restartable semantics (i.e. same semantics as read--just call me again), especially for libraries or components that want to be event loop agnostic.

jasonzemos · on Oct 24, 2022

> io_uring doesn't replace poll/epoll, it effectively replaces typical event loop frameworks.

I'm sorry but that statement is incorrect. The same functionality is accomplished by either io_uring or epoll.

> every line in an application that calls read/recv needs to be refactored

If the GP finds it easy to refactor from poll to epoll they will find it no different to refactor from poll to io_uring; the only caveat being they won't exceed epoll's performance with the io_uring until the standalone syscalls are further refactored. They don't need to be. The statement is incorrect.

drpixie · on Oct 22, 2022

Does anyone feel that the Linux API (and so the kernel) is slowly getting more and more complex and cumbersome?

rwmj · on Oct 23, 2022

That ship sailed a very long time ago - arguably after V7 if you consider Unix as a whole. The more important questions: Do the various APIs interact in ways which are insecure? Are corner-cases being properly tested? Are old interfaces properly documented as deprecated (with replacements)? Are the interfaces that you actually need easy to use? Does the kernel maintain backwards compatibility. Linux does fairly well on many of these with some exceptions (hello there extended filesystem attributes API).

sylware · on Oct 22, 2022

I wrote many of my own programs on elf/linux: I do epoll as much as I can.

The only troubling thingy is the lack of classification of signals, those who are synchronous by nature and the other ones. For instance, in a monothreaded application, segfault won't be delivered via epoll...

At the same time, it is still important to keep the asynchronous API for signals that for lower latency, but then, only the realtime behaviour should be kept since this is really where latency does matter.

cryptonector · on Oct 23, 2022

Synchronous signals can't possibly be delivered via an asynchronous mechanism like epoll. Well, I suppose they could, but only if the thread that tripped the synchronous signal is put to sleep until some other thread runs a handler for that signal, but this requires APIs that (to my knowledge, except... ptrace) don't exist, and there's no need for them. If you can handle a SIGBUS or SIGSEGV, you don't need to handle them asynchronously, but on the other hand if you can't handle them, then the process will die, so there's just no need to handle these asynchronously.

sylware · on Oct 23, 2022

Exactly what I said: this needs clarification. Maybe adding explicitely the 2 flavors of signals.

cryptonector · on Oct 24, 2022

No, it doesn't need clarification. Synchronous signals are always always synchronous, therefore it's obvious that they cannot be handled asynchronously with <fill-in-the-blank-async-system>. It's really clear. And if you stop to think about why synchronous signals are synchronous, it's even more clear.

sylware · on Oct 25, 2022

I disagree, I think it is not clear enough in POSIX.

guenthert · on Oct 22, 2022

Thanks for the reminder that there is no non-blocking i/o for files residing on block devices.

yxhuvud · on Oct 22, 2022

But there is, io_uring.

m00dy · on Oct 22, 2022

io_uring, a magical keyword I used to use in job interviews...

yxhuvud · on Oct 22, 2022

Having done my own share of uring bindings I wish I had found work places that appreciated that.

guenthert · on Oct 22, 2022

That is async i/o afaiu and not classic Unix non-blocking i/o (O_NONBLOCK given to 2 open()).

yxhuvud · on Oct 22, 2022

Sure. But why does the difference matter? It is not as if epoll is classic Unix either.

guenthert · on Oct 22, 2022

epoll might not be, but poll is (depending on how one would interpret 'classic').

Anyhow, I wrongly assumed the difference mattered in respect of whether one could use io_uring in combination with epoll(). It turns out, one can [1] or [2].

[1] https://stackoverflow.com/questions/70132802/waiting-for-epo...

[2] https://unixism.net/loti/tutorial/register_eventfd.html

healthandsafety · on Oct 22, 2022

Care to elaborate?

kortilla · on Oct 22, 2022

Everyone says it’s better on paper but you rarely get to actually use it in real code.

emilfihlman · on Oct 22, 2022

Regular files not having a non blocking mode is one of the biggest and gravest idiocies on linux land.

And there's one even worse: even having the concept of uninterruptiple sleep (D).

kocial · on Oct 23, 2022

Your link to `Go` project has incorrect link at footer "in practice" section.

graham_king_3 · on Oct 23, 2022

Fixed, thank you.

bitwize · on Oct 22, 2022

Why epoll when you can io_uring? In Rust?

karthikmurkonda · on Oct 22, 2022