>> If you really think about it, the only real difference between main memory blocking and disk blocking is the amount of time they may block.
>
> This is a somewhat confusing analysis you have here. Direct read/write from memory for all intents and purposes doesn't block. Why do you say that reads and writes may also block?
Reads and writes from actual, physical, hardware memory might block, depending on how you define "block", in the sense that some reads may miss CPU cache. But once you get to that point, you could argue that every branch might block if the branch misprediction causes a pipeline stall. This is not a useful definition of "block".
The thing is, most programs are almost never low-level enough to be dealing with memory in that sense: they read and write virtual memory. And virtual memory can block for any number of reasons, including some pretty non-obvious ones like. For example:
- the system is under memory pressure and that page is no longer in RAM because it got written to a swap file
- the system is under memory pressure and that page is no longer in RAM because it was a read-only mapping from a file and could be purged
-- e.g. it's part of your executable's code
- this is your first access to a page of anonymous virtual memory and the kernel hadn't needed to allocate a physical page until now
- you're in a VM and the VMM can do whatever it wants
I think what I'm saying is that calling file I/O "blocking" is also not a useful definition of "block". Because I don't really see the fundamental difference between "we have to wait for main memory to respond" and "we have to wait for disk to respond".
> this is your first access to a page of anonymous virtual memory and the kernel hadn't needed to allocate a physical page until now
And said allocation could block on all sorts of things you might not expect. Once upon a time I helped debug a problem where memory allocation would block waiting for the XFS filesystem driver to flush dirty inodes to disk. Our system generated lots of dirty inodes, and we were seeing programs randomly hang on allocation for minutes at a time.
> I think what I'm saying is that calling file I/O "blocking" is also not a useful definition of "block". Because I don't really see the fundamental difference between "we have to wait for main memory to respond" and "we have to wait for disk to respond".
In addition to the point elsewhere made that you're sort of implicitly denying the magnitude of the differences here - the latency differences are on the order of 1000s.
The other way of separating is if the OS (or some kind of software trap handler more generally) has to get involved. A main memory read to a non-faulting address doesn't involve the OS - ie it doesn't ever block. However faulting reads, calls to "disk" IO, and networking IO (ie just I/O in general) involving the OS/monitor/what have you are all potentially blocking operations.
It does not matter whether the OS is involved. Consider a spinlock; if it is spinning, waiting on the lock to be released, then it is blocking.
What matters is whether control returns to the process before the operation is complete. If the process waits, it is blocking (aka synchronous); if the process does not wait, it is non-blocking (and possibly also asynchronous if it checks later to see if the operation succeeded).
> Because I don't really see the fundamental difference between "we have to wait for main memory to respond" and "we have to wait for disk to respond".
The difference, conservatively, is a factor of 1000.
There are plenty of times in software engineering where scaling 1000x will force you to reconsider your architecture.
To be clear I do not believe that async disk I/O is never useful, I just think that it's not as useful as people at first imagine when they learn about async I/O.
Yes, it may be 1000 times slower than memory. But there's a fundamental paradigm difference from network events, in that with network events you are waiting for some other entity to take action, with no implicit expectation that they will do so in any particular timeframe. Like, if you're waiting for connections on a listen socket, there's no telling how long you will be waiting.
Disk I/O is fundamentally different in that once you submit an operation, you expect it to complete within a reasonable, finite time period.
Async disk I/O is primarily useful for implementing read-ahead / write-behind scheduling behaviors. While databases tend to be the obvious use case, the OS is often so poor at this that there are large performance improvements even for much simpler use cases that are otherwise disk I/O intensive.
I'm not sure that's the primary use case any more. Fast SSDs require high queue depths to use their full throughput, so async IO is desirable to use any time an application knows it has several IO requests to issue in parallel—one thread per request has too much overhead.
Sure, but that behavior is effectively read-ahead / write-behind on your I/O buffers. That doesn't mean much more than anticipating future I/O operations before completion of that I/O is required by the code for efficient forward progress.
They're really not equivalent. Read-ahead only helps for predictable IO patterns. Issuing multiple read requests in parallel from the application is useful in a far broader range of scenarios. And for both reads and writes, being able to submit IO in batches (without having to wait for the entire batch to complete) can drastically cut down on overhead compared to submitting IOs sequentially as if they were a linear dependency chain, and makes it possible to keep the storage properly busy instead of it idly waiting on the host software to prepare and submit the next IO.
All cache replacement algorithms are literally equivalent to universal sequence prediction problems, per the optimality theorem. There is no implication of sequential decisions here. When you schedule a batch of disk I/O, you are essentially front-running the sequence predictor to avoid classes of prediction failure where successful prediction would be computationally intractable (and therefore not implemented in real systems), which is expected to produce better I/O throughput on average if done competently per the same theory. There is nothing magic about this, it is in the literature, and databases in particular have explicitly exploited non-sequential scheduling to circumvent fundamental sequence prediction limits for decades. Optimally anticipating future requirements for reads and writes can be called whatever you like, but that remains the primary use case for async I/O since you can't do it with blocking I/O in a single thread.
This becomes more important as caches become larger because cache efficiency increases are strongly sublinear as a function of size, as expected. Servers are already at the scale where very deep async I/O scheduling is required for consistent throughput with high storage density, beyond what can be done via traditional buffered disk I/O architectures, async or not. It is an active area of research with some interesting ideas.
Reads and writes from actual, physical, hardware memory might block, depending on how you define "block", in the sense that some reads may miss CPU cache. But once you get to that point, you could argue that every branch might block if the branch misprediction causes a pipeline stall. This is not a useful definition of "block".
The thing is, most programs are almost never low-level enough to be dealing with memory in that sense: they read and write virtual memory. And virtual memory can block for any number of reasons, including some pretty non-obvious ones like. For example:
- the system is under memory pressure and that page is no longer in RAM because it got written to a swap file
- the system is under memory pressure and that page is no longer in RAM because it was a read-only mapping from a file and could be purged
-- e.g. it's part of your executable's code
- this is your first access to a page of anonymous virtual memory and the kernel hadn't needed to allocate a physical page until now
- you're in a VM and the VMM can do whatever it wants
- the page is COW from another process