>> the guarantee that data has been committed to somewhere durable when a write() call returns without an error is a semantic aspect of the POSIX write() API
> No, that's just flat out wrong.
from "man 3p write":
After a write() to a regular file has successfully returned:
* Any successful read() from each byte position in the file that was
modified by that write shall return the data specified by the
write() for that position until such byte positions are again modi‐
fied.
* Any subsequent successful write() to the same byte position in the
file shall overwrite that file data.
>> POSIX I/O is stateful
> This is fundamental to the authorization model. Authorization happens at file open time. It's also what enables the stream abstraction.
This is very true, but in the workloads the author is talking about, there are often times that a stateless API would enable a more efficient implementation. Think about what is going on in your file server when you have 100k clients all accessing the same open file.
> . I don't know anything about the domain the author talks about (HPC), but it seems what he needs is basically direct access to the block device. Or writing away through network sockets / using a database.
The author is talking about (possibly distributed) networked filesystems backing clusters with extreme levels of parallelism (minimum 100s of nodes with 10s of processors on each node, and it gets much bigger). As far as "using a database" that falls under the category of a user-space I/O stack, where the (userspace) database is proxying the I/O to reduce state.
The title of the article isn't at all bold in context, because it is well accepted in HPC that POSIX I/O is the bottleneck for certain types of loads, and the author is clarifying to those not familiar with the details why this is true.
Any successful read() from each byte position in the file that was
modified by that write shall return the data specified by the write()
... perhaps I can re-open() and re-read() the same byte value written by another process for the same file, but the file contents may not have been fully flushed all the way to disk. The file contents may be "durable" across processes on the same running OS that mount the same filesystem ... but if the OS happens to die before the data is flushed, then perhaps after reboot the open()/read() will return an older value previously written.
The semantics of "durability" are a squishy concept.
yes, the term "durable" was perhaps a poor choice of words, but the paragraph that followed made it clear that they were aware of the specific requirements (specifically mentioning making dirty caches available to all processes)
>> This is fundamental to the authorization model. Authorization happens at file open time. It's also what enables the stream abstraction.
> This is very true, but in the workloads the author is talking about, there are often times that a stateless API would enable a more efficient implementation. Think about what is going on in your file server when you have 100k clients all accessing the same open file.
Kernel can already cache such checks, I suppose. If you open a file 100k for 100k users you definitely have other scaling problems to solve first. Even if that becomes a bottleneck, a simple userspace LRU can solve the problem.
I don't understand how having the permission check performed upon every operation is faster than performing it only once at open.
But you don't need to keep the state of "this file is opened".
What if the task that is being processed is idempotently writing something to the disk and then fails (and we assume we should be able to repeat it)?
Statefulness would make us write the code to process the closure of the file and also to process all of the failures if we failed to close the file for proper RAII.
You don't keep the state "this file is opened", the kernel does. All you have is a ticket, the file descriptor.
The problem is that the bookkeeping also includes the position in the file. I guess an API with a position argument could work, when you can leave it null for network operation (or unseekable file streams).
But again, this does not make the API completely stateless and for good performance reasons.
> No, that's just flat out wrong.
from "man 3p write":
>> POSIX I/O is stateful> This is fundamental to the authorization model. Authorization happens at file open time. It's also what enables the stream abstraction.
This is very true, but in the workloads the author is talking about, there are often times that a stateless API would enable a more efficient implementation. Think about what is going on in your file server when you have 100k clients all accessing the same open file.
> . I don't know anything about the domain the author talks about (HPC), but it seems what he needs is basically direct access to the block device. Or writing away through network sockets / using a database.
The author is talking about (possibly distributed) networked filesystems backing clusters with extreme levels of parallelism (minimum 100s of nodes with 10s of processors on each node, and it gets much bigger). As far as "using a database" that falls under the category of a user-space I/O stack, where the (userspace) database is proxying the I/O to reduce state.
The title of the article isn't at all bold in context, because it is well accepted in HPC that POSIX I/O is the bottleneck for certain types of loads, and the author is clarifying to those not familiar with the details why this is true.