Hacker News new | past | comments | ask | show | jobs | submit login
Things Unix can do atomically (2010) (rcrowley.org)
170 points by pabs3 on Aug 14, 2023 | hide | past | favorite | 49 comments



The rename system call is not quite atomic. From https://linux.die.net/man/2/rename:

> However, when overwriting there will probably be a window in which both oldpath and newpath refer to the file being renamed.

A better way to think about rename is that it's given by two atomic operations in sequence: (1) An atomic link call to make the file available under the new name, and then (2) an atomic remove operation to delete the old name. Thus, it's possible to observe the state after (1) but before (2).


Dan Luu also says rename is not atomic, at the bottom of this post: https://danluu.com/file-consistency/

I've wondered ever since I read that what the first bullet there means. It can't be what you're saying here, what you're saying would apply to racing processes as well as crashes?

So then.. what does he mean when he says rename isn't atomic if there's a crash? What's the failure condition / in-between state that can occur if there's a crash?


>what does he mean when he says rename isn't atomic if there's a crash?

Not sure. One of the papers he cites [1] has this to say about rename atomicity:

>Directory operations such as rename() and link() are seemingly atomic on all file systems that use techniques like journaling or copy-on-write for consistency

So it seems to depend on the file system. But the authors claim that not just link but also rename is atomic without qualification, which appears to be false, so I'm not sure how trustworthy this is.

There's a stackoverflow comment [2] that suggests that ext4 guarantees the kind of sequential atomicity property (atomic link followed by atomic remove) I mentioned:

>Kernel implementation depends on filesystem but here's implementation of Linux ext4 filesystem: elixir.bootlin.com/linux/v5.19.3/source/fs/ext4/namei.c#L3887 – it seems to first mark the inode as dirty, then create new link and after that delete the old link. If the kernel crashes in the middle, the end result could be two links and dirty mark. I would guess (but didn't investigate if this true) that journal recovery during the next mount would fix the end result to match atomic behavior of leaving either the old or new state depending on exact crash location.

In general, I find it rather weird how little authoritative documentation there is about whether or not this property holds given how important it is.

[1] https://www.usenix.org/system/files/conference/osdi14/osdi14...

[2] https://stackoverflow.com/questions/7054844/is-rename-atomic...


> I find it rather weird how little authoritative documentation there is about whether or not this property holds given how important it is.

Well, it can't be that important then, can it? :)


Usually the atomicity I care about is with regard to newpath: I either observe no file/previous file at newpath, or I observe the new file at newpath. This is in opposition to a copy for example, where I might observe a partially-written file.

This is a great point though, when taking oldpath into account the operation is not atomic. TIL


the type of atomicity I've always seen it used for is, two processes try to rename a file at the same time, only one will succeed; the winner does what they planned and then rename the file back.

I guess you shouldn't ignore the error with both processes using the same destination name, why folks generally tend to put their pid in the lockfilename


Umm, two concurrent rename(2) syscalls to the same destination path will likely both succeed. One of them will win (not be overwritten), but both should succeed.

If you want atomic creation of a path, where no one else can succeed also, you want link(2) not rename(2).


The author intends the list to be read as 'rights' or 'entitlements' of user land code:

> The philosophy here is to let the kernel do as much work as possible. At my most pessimistic, I trust the kernel developers more than a trust myself. More practically, it’s stupid to spend CPU time locking around an operation that’s already atomic.

From a system's point of view, this list is also a list of demands on file system design. And you can tell that POSIX was invented before distributed computing really was a thing.

Some of the POSIX demands are really quite costly to satisfy in a distributed, networked filesystem, and often unnecessarily degrade performance.


It's impossible to reliably satisfy anything in a distributed networked anything. There aren't costs as much as tradeoffs. Reliable, fast, accurate: pick one.


That's a trite observation that doesn't actually _say_ anything.

But formulated in that language, my comment becomes: POSIX picks less than useful trade-offs for distributed file systems.

An extreme example: POSIX's atime gives you something you seldom need, but imposes a huge burden. That's such a bad trade-off, that approximately no-one implements the original spec, even if nothing is distributed at all.


It’s not a bad idea, atime, just that our hardware never developed any tech where writes are as cheap as reads. Or even close. I wonder if that’s theoretically possible.


When they were coming up with 'atime' and thought it was a good idea, it was because with their tech writes were about as cheap as reads.


> just that our hardware never developed any tech where writes are as cheap as reads. Or even close. I wonder if that’s theoretically possible.

I had it in my head that modern flash actually does read in a way that involves writing, and we just paper over that for the sake of the abstractions we use. Am I wildly misremembering?


You might be thinking of DRAM, which does work that way. Reads discharge the memory array capacitance so the data has to be amplified and rewritten.


As far as I know, reads are reads. Though, writes aren't quite so simple

Sometimes blocks may have to be read and re-written elsewhere to make room... and this is what makes using TRIM and having cache on-disk so important.


Oh, I might have gotten it backwards; writes actually involving reads could make sense.


Trite?

He basically stated the CAP theorem. It's been proven mathematically.

Essentially if you distribute your data, you have to choose some degree of consistency versus availability.


The CAP theorem says you can satisfy two of C,A,P, not that "it's impossible to reliably satisfy anything".


And the CAP theorem only has a small relationship with why POSIX is bad for filesystems.


The path to really big distributed datastores go through the CAP theorem.

I'm not fully familiar with POSIX requirements for filesystems, but having atomic changes and good performance and a host of other aspects to keep the illusion of available and consistent files in a distributed store runs right through the CAP theorem.


The problem isn't so much requiring atomic changes at all, but what kind of changes are required to be atomic, and which changes are not atomic. The POSIX tradeoffs are just horrible here.

As an example of how to do better, squint a bit and have a look at git interpreted as (part of) a distributed data store.

Git doesn't pretend that you can make a change in one location, and have it atomically show up in all locations. On the other hand, Git does allow you to make changes to multiple files, and only have them show up together or not at all.

Git also has explicit mechanisms for resolving conflicts when multiple people tried to make changes to the same files.

Git doesn't get around the CAP theorem; it just has much better trade-offs than POSIX, because Git was designed explicitly for distributed use.

POSIX also doesn't allow you to have different trade-offs for read-only and read-write data.


What's the difference between "reliable" and "accurate"?

In any case, I have seen enough of these "A, B, C : pick N" (where 0 < N < 3) mantras shot down (usually by re-examining whether the constraints/assumptions of the original mantra apply in a specific situation, or just through plain algorithmic engineering) that whenever someone drops something like this, I mostly tune it out.

In the case of distributed network <anything>, implementing programs in languages that help you manage important invariants automatically (and preferably not at run-time, but at compile-time) is key in creating robust systems that are both reliable, and fast.

But if you have such tools, and you are trying to also meet a set of standards that were designed without any care for distributed computing, then...yeah.


I think they probably are misremembering consistency, partition-tolerance, availability; pick 2.


No, I'm familiar with CAP. My theorem is different.

Reliable: A system can be available but not reliable; if a machine is down 6 minutes every hour, it's 90% availabile but less than 1 hour reliable. A networked distributed system can be reliable but not accurate if you keep it online while it slowly accrues errors, such as the mentioned link/unlink race condition. The machine still works, it just has occasional corruption. It tends not to be very fast if it's reliable.

Fast: You can do whatever you want quickly. Think UDP, or filesystems that don't commit their log journal frequently. Tends not to be reliable or accurate.

Accurate: Whatever you do has no corrupted state. Think ACID. Most systems don't do this because it's expensive/complicated.


Apologies - fair enough. I can't quite tell if that is a genuine set of three alternatives or not[0], but I sort of see your point. Couple of unsure places:

> Think ACID. Most systems don't do this because it's expensive/complicated.

Is this true? A lot of systems are built on databases.

> Accurate: Whatever you do has no corrupted state

How does accuracy work if the system is distributed and there's a partition, and two node get different data? Presumably it's reliable if they're different, as accuracy is a different item?

> Fast: You can do whatever you want quickly. Think UDP, or filesystems that don't commit their log journal frequently. Tends not to be reliable or accurate.

Why wouldn't this be reliable?


Most databases aren't ACID compliant, and even if they can be, they're often used in a non-ACID way.

> How does accuracy work if the system is distributed and there's a partition, and two node get different data?

You don't allow writes during a partition. All nodes (or a quorum) must confirm on each write, and if that fails then a partition is detected. Everything that was accidentally written before you detected partition is thrown away. Usually you just have one writer node for simplicity.

> Fast: > Why wouldn't this be reliable?

If you prioritize reliability you will need a slower method to ensure reliability. TCP is slower than UDP so it can be reliable.


> You don't allow writes during a partition. All nodes (or a quorum) must confirm on each write, and if that fails then a partition is detected. Everything that was accidentally written before you detected partition is thrown away. Usually you just have one writer node for simplicity.

Wouldn't this be done better with a majority/minority approach, where if enough nodes are still communicating then they still work?

> If you prioritize reliability you will need a slower method to ensure reliability. TCP is slower than UDP so it can be reliable.

Agree - that is true in the TCP sense of the word reliable. But in your sense of the word, as I understood it, the TCP/UDP difference is more like "accuracy", is it not? Reliability is more like availability, as I read it in your previous comment.



Is this a theorem, or a rule of thumb? I can't quite tell, because I thought theorems had proofs to go along with them?


You can prove it, if you pick some assumptions.


> Some of the POSIX demands are really quite costly to satisfy in a distributed, networked filesystem, and often unnecessarily degrade performance.

Especially things like `stat(2)`, which complicate things for distributed filesystems like Lustre. Or the lack of a readdir()+stat().


There's a lot wrong or at least weirdly described in the symlink entry.

> symlink(oldpath, newpath) operates very much like link(2) but creates a symbolic link at a new inode rather than a hard link to the same inode.

This is a really rough description. What really happens with a hardlink is a new directory entry is created that points to the same inode; with a symlink a new inode is created of type symlink that points to a directory entry by path/name.

> Symbolic links can point to directories, which hard links cannot, making them a perfect analogy to link(2) when locking entire directories. This will fail with the error code EEXIST if newpath already exists, making this a perfect analogy to link(2) that works for directories, too.

Except there's no way to count the number of symlinks that point to a given path like there is with the link count of the inode with multiple hardlinks.

> Be careful of symbolic links whose target inode has been removed ("dangling" symbolic links) — open(2) will fail with the error code ENOENT.

As symlinks point to a path, a dandling symlink is the result of the target path being removed (or not existing), not removal of an inode. You can put something else at that path, and the symlink will take you there instead. ENOENT is returned because the symlink doesn't point to anything.

The lstat(2) call can be used to examine symlinks directly, normally file operations access the file through the symlink making it mildly challenging to distinguish between a regular file and symlink.

> It should be mentioned that inodes are a finite resource (this particular machine has 1,245,184 inodes).

This is unrelated information. It is filesystem format dependent.


There also is a limit to the number of hardlinks to an inode (filesystem dependent).


The POSIX specification defers most semantics of rename to ISO C. It does mention atomicity, but only in the non-normative “Rationale” section, and also again only with reference to ISO C, so this is an aspirational rather than a strict requirement of POSIX. On referral, however, it turns out that the C standard omits an explicit requirement for rename to be atomic. Consequently, rename is not definitively atomic in either POSIX or C.

I’m reminded that NFS was for a long time a popular storage for ISP mail but notoriously prone to non-atomic renames (or anything else, for that matter), and that is why many maildir implementations instead prefer an hard link followed by an unlink, via a tmp/ subdirectory on the same filesystem, during delivery and move/copy operations.


The faulty renames can be handled by looking for stale files and restoring or removing them. Once mail gets into the delivered directory, any duplicate link in a temporary directory gets deleted


They can, but it’s easier to immediately observe the outcome of link(2). What’s more, retrying the hard link is idempotent, whilst retrying a rename is not, especially in distributed or eventually-consistent circumstances. The link/unlink dance sidesteps a whole class of uncertainties. Assuming the filesystem at hand actually supports it, or course.


Right; so if you aren't sure about the filesystem, rename is the simplest most portable thing, plus the dumb workarounds. In practice I've never actually seen the race condition mentioned, but I'm sure it happens on large enough scales.


This is not a race condition. Any filesystem syscall can fail, and not always spectacularly. I can't recommend using rename for moving files into place, except as a fallback for filesystems where hard links are unavailable.


Any time you see "atomic", you can make a queue.

A practical application of atomic mv is building simple file based queuing mechanisms.

For example I wrote this SMTP buffer server which moves things to different directories as a simple form of message queue.

https://github.com/bootrino/arniesmtpbufferserver

It's pretty simple - ~100 lines of Python without comments etc.

https://raw.githubusercontent.com/bootrino/arniesmtpbufferse...

Caveat I think this needs examination from the perspective of fsync - i.e. I suspect the code should be fsyncing at certain points but not sure.

I also wrote (in Rust) a simple file based message queue using atomic mv. It instantly maxed out the SSD performance at about 30,000 messages/second.


One application of this could be writing to a SQLite database.

If you wrap your writes in a shell script controlled by a lock directory, you are guaranteed to never receive SQLITE_BUSY.

  #!/bin/sh

  until mkdir /var/tmp/mylock
  do sleep 1;
  done

  trap 'rmdir /var/tmp/mylock' ABRT EXIT FPE HUP INT QUIT SEGV TERM

  sqlite3 mydatabase.db 'INSERT INTO MYTABLE VALUES (1,2,3);'
Since the mkdir() is atomic, the SQLite busy handler is now irrelevant with this approach.

On Linux, there is also an flock utility, but it tends to need subshells which are not as efficient.

EDIT: This does reduce the write performance to one of these executions per second. Simple inserts like this can hit a few dozen per second otherwise, as I have heard.


The way to deal with link() on NFS is to stat() the old file after and see if it has two links. That's how you know you won.


Indeed:

https://github.com/jaysoffian/dotlock/

I think I first learned that trick from procmail and then confirmed it with the NFS spec? Been a long time...


These kind of guidelines always fail short that there isn't one UNIX.

There is POSIX, and then whatever each OS that implements it decided to do, beyond what isn't explicitly required for UNIX trademark certification.

Hence why it is so much "fun" writing portable code across UNIX/POSIX platforms.


Writes to pipes smaller than PIPE_BUF are guaranteed atomic by POSIX:

https://man7.org/linux/man-pages/man7/pipe.7.html


This is a can of worms, what's the size of PIPE_BUF? Surely that's the output of 'getconf PIPE_BUF /tmp' Whoa, why's that so small, that can't be right. I'll run this test with dd on my FreeBSD box and see what it really is. What in... it CHANGES!? Okay I'll just use 512.

So writes are atomic but what about reads? Oh that's why pathconf(9) returned that small number! You sure? What if a signal sneaks in while reading?


> By making frequent use of msync(addr, length, MS_INVALIDATE), data written in this manner can be shared between processes that both map the same file. mmap(2), msync(2).

MMAP_SHARED means that the same physical memory pages are available via different virtual addresses. All the writes are visible immediately, at least in the same way as in the regular memory (you may still need mfence-es or a cache flush depending on CPU).

MS_INVALIDATE is essentially a no-op on Linux.


unlink(2) and rmdir(2) are atomic as well.

Also, there's a bit of atomicity in O_APPEND writes. Specifically, the end of the file will be adjusted atomically, though the actual writes may not happen atomically (beware!). Thus processes that use O_APPEND to write to the same file will not have their writes stepping on each others', but if there's a power outage you can have a situation where the file's length was adjusted as if one O_APPEND write completed but you might get zeroes instead of the actual bytes meant to be written.

As for mmap() and msync()... I don't recommend it. First, if the buffer and page cache are shared in the filesystem, then msync() is essentially a no-op. Second, if the mapping is not anonymous and you're writing to a file and anything goes wrong you'll get SIGBUS (ugh). Third, if the buffer and page caches are not shared then you really do need msync(MS_SYNC), and that can be expensive. Reading files via mmap() is ok, but writing to them via mmap() is fraught.

"Leases" are a Linux thing, not POSIX or Unix.


Really cool use of strace to peak under the covers of what the user land is doing. Very cool blog. Thanks to Pabs3 for posting it.


His first example -- "mv -T" -- is not POSIX, and is not available, eg., under Busybox.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: