Hacker News new | past | comments | ask | show | jobs | submit login

>> the guarantee that data has been committed to somewhere durable when a write() call returns without an error is a semantic aspect of the POSIX write() API

> No, that's just flat out wrong.

from "man 3p write":

       After a write() to a regular file has successfully returned:

        *  Any successful read() from each byte position in the file that  was
           modified  by  that  write  shall  return  the data specified by the
           write() for that position until such byte positions are again modi‐
           fied.

        *  Any  subsequent successful write() to the same byte position in the
           file shall overwrite that file data.

>> POSIX I/O is stateful

> This is fundamental to the authorization model. Authorization happens at file open time. It's also what enables the stream abstraction.

This is very true, but in the workloads the author is talking about, there are often times that a stateless API would enable a more efficient implementation. Think about what is going on in your file server when you have 100k clients all accessing the same open file.

> . I don't know anything about the domain the author talks about (HPC), but it seems what he needs is basically direct access to the block device. Or writing away through network sockets / using a database.

The author is talking about (possibly distributed) networked filesystems backing clusters with extreme levels of parallelism (minimum 100s of nodes with 10s of processors on each node, and it gets much bigger). As far as "using a database" that falls under the category of a user-space I/O stack, where the (userspace) database is proxying the I/O to reduce state.

The title of the article isn't at all bold in context, because it is well accepted in HPC that POSIX I/O is the bottleneck for certain types of loads, and the author is clarifying to those not familiar with the details why this is true.




For this ...

    Any successful read() from each byte position in the file that  was
    modified  by  that  write  shall  return  the data specified by the write()
... perhaps I can re-open() and re-read() the same byte value written by another process for the same file, but the file contents may not have been fully flushed all the way to disk. The file contents may be "durable" across processes on the same running OS that mount the same filesystem ... but if the OS happens to die before the data is flushed, then perhaps after reboot the open()/read() will return an older value previously written.

The semantics of "durability" are a squishy concept.


yes, the term "durable" was perhaps a poor choice of words, but the paragraph that followed made it clear that they were aware of the specific requirements (specifically mentioning making dirty caches available to all processes)


Durable has a very specific meaning for I/O and is just not correct here.

POSIX does intentionally not specify any durability at all (e.g. a no-op fsync is explicitly permitted).


And in practice it is close to that, due to cheating hardware with a cache.


You can actually do that with mmap and if you want shm. (Mmap works on files well.) Of course then you have to deal with all the dirt yourself.


>>> POSIX I/O is stateful

>> This is fundamental to the authorization model. Authorization happens at file open time. It's also what enables the stream abstraction.

> This is very true, but in the workloads the author is talking about, there are often times that a stateless API would enable a more efficient implementation. Think about what is going on in your file server when you have 100k clients all accessing the same open file.

Kernel can already cache such checks, I suppose. If you open a file 100k for 100k users you definitely have other scaling problems to solve first. Even if that becomes a bottleneck, a simple userspace LRU can solve the problem.

I don't understand how having the permission check performed upon every operation is faster than performing it only once at open.


But you don't need to keep the state of "this file is opened".

What if the task that is being processed is idempotently writing something to the disk and then fails (and we assume we should be able to repeat it)?

Statefulness would make us write the code to process the closure of the file and also to process all of the failures if we failed to close the file for proper RAII.


You don't keep the state "this file is opened", the kernel does. All you have is a ticket, the file descriptor.

The problem is that the bookkeeping also includes the position in the file. I guess an API with a position argument could work, when you can leave it null for network operation (or unseekable file streams).

But again, this does not make the API completely stateless and for good performance reasons.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: