PATH_MAX Is Tricky

problems · on April 25, 2017

Windows has a similarly disastrous situation where most tools and APIs follow MAX_PATH - which is defined to be 260 chars. But that doesn't affect the actual filesystem or syscall interface, just common APIs and tools. This makes it impossible to delete files from windows explorer for example.

If you want to fix this you basically have to bypass it by using "\\?\" on the front of the full path. The situation gets messy when you're trying to write an installer with node packages especially.

https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...

Tempest1981 · on April 26, 2017

Even using "\\?\" you may hit the max filename limit of NTFS, which is 255 (aka maxComponentLength) ala NAME_MAX. The Win32 error is 123: "The filename, directory name, or volume label syntax is incorrect."

https://msdn.microsoft.com/en-us/library/aa365247(VS.85).asp...

wruza · on April 26, 2017

I deleted such long paths via moving subdir to root directory first. It seems in that case that directory path was fit, but with filename it overflowed. For very long paths one needs to disassemble it part by part into the root to move or delete.

Hanoi tower game built-in.

dom0 · on April 25, 2017

That's an issue that lies in the unfortunate intersection of 16/32 bit Windows, Windows NT and MS libc (MSVCRT), which support some combination of two and a half system designs.

frik · on April 25, 2017

Microsoft could have fixed it with Win 64bit (Win64).

They had done such a step before with the switch from Win16 to Win32 and the help of Win32s. With Win32 they cleaned up the old API, fixed things yet kept it source compatible when possible. Microsoft could have fixed so many things with Win64 starting with Windows 2003 64-bit. But no, Microsoft invested little on native Windows API between 2002 and 2012 - Longhorn (later Vista) and dotNet were the latest hype.

mikequinlan · on April 25, 2017

Microsoft has partially fixed this in Windows 10 but it is not simple or easy. https://blogs.msdn.microsoft.com/jeremykuhne/2016/07/30/net-...

Dylan16807 · on April 26, 2017

> Microsoft could have fixed it with Win 64bit (Win64).

Ha, they wouldn't even rename System32 when converting it to 64 bit. There was no chance of an API cleanup.

yuhong · on April 26, 2017

It would not be as simple as you think though. Example: https://msdn.microsoft.com/en-us/library/windows/desktop/bb7...

problems · on April 25, 2017

Seems like they're just starting to fix it now with Win10 and only on applications that ship a special manifest (or on entire machines if you set a specific registry entry).

derefr · on April 25, 2017

> The problem is that you can’t meaningfully define a constant like this in a header file. The maximum path size is actually to be something like a filesystem limitation, or at the very least a kernel parameter.

AFAIK, "paths" aren't a thing filesystems think about. As far as a filesystem driver is concerned, paths are either—for open/read/write/etc.—plain inodes (a uint64), or—for directory-manipulation calls—an inode plus a ptrdiff_t to index into the dirent list†. The only things that care about NAME_MAX are lookup(2) [get inode given {inode, dirent}], and link(2) [put inode in {inode, dirent}].

So it's really only the kernel, through its syscall interface, that cares about paths—and so PATH_MAX is just a representation of the maximum size of a path the kernel is willing to accept in those syscalls. As if they each had a statically-allocated path[PATH_MAX] buffer your path got copied into.

† Writing a FUSE filesystem is a great way to learn about what the kernel thinks a filesystem is. It's very different from the userland perspective. For example, from a filesystem driver's perspective, "file descriptors" don't exist! Instead, read(2) and write(2) calls are stateless, each call getting passed a kernel-side io-handle struct that must get re-evaluated for matching permissions on each IO operation. (You can do some up-front evaluation during the open(2) call, but, given that a file's permissions might change while you have a descriptor open to it, there's not much point.)

jstimpfle · on April 25, 2017

> For example, from a filesystem driver's perspective, "file descriptors" don't exist! Instead, read(2) and write(2) calls are stateless

That seems hard to believe. It would inefficient to look for a file for each e.g. write(). I guess a filesystem defines its own implementation-defined open-file handle and the kernel translates that into an (integer) FD.

> each call getting passed a kernel-side io-handle struct that must get re-evaluated for matching permissions on each IO operation. (You can do some up-front evaluation during the open(2) call, but, given that a file's permissions might change while you have a descriptor open to it, there's not much point.)

That's not how it works. File modes are checked when you open the file. Once you have an open file (a file handle), the file modes no longer matter.

File modes are more a "PATH" thing (there aren't per-filesystem variations and I assume permission checking is not done in file system code), although the file system must allocate the space to save the file mode bits and implement the VFS API.

loeg · on April 25, 2017

> That seems hard to believe. It would inefficient to look for a file for each e.g. write(). I guess a filesystem defines its own implementation-defined open-file handle and the kernel translates that into an (integer) FD.

At least in BSD, the syscall layer calls into the kernel virtual filesystem, which translates file descriptors into file objects. File objects have fo_read/fo_write methods implemented by different kinds of file object.

Files backed by filesystems have an associated "vnode" (virtual inode object). Then the fo_read/fo_write layer invokes the VOP_READ or VOP_WRITE (for example) method on the vnode. VOP ("vnode operation") methods are implemented by individual filesystems. Multiple user-level file descriptors can refer to a single vnode.

The filesystem's inode object hangs off a pointer from the generic vnode. Vnodes are unique (1:1) per real file (inode).

So:

  read(fd, ...) ->
    sys_read -> kern_readv ->
      fget() (translate fd number into struct file object;
        check that the descriptor was opened read/write as appropriate for the operation)
      dofileread() ->
        fo_read() ->
          vn_io_fault() (fo_read method for vnodes) -> vn_read ->
            VOP_READ(vnode, uio, ioflags, ucred)
            (vnode for the file, an io descriptor ("uio"), io flags, and user credentials)

So yeah, at the filesystem layet (VOP_READ), you no longer have the fd (or even the file object) in the BSD model.

Linux's model is similar, but not completely identical. They call "vnodes" "inodes" instead, and may not have the intermediary file layer.

codys · on April 25, 2017

Yep, this is misleading.

"file descriptors" taken _literally_ don't exist (they are just the interface exposed to user space programs). Linux uses `struct file` internally as the generic file tracking, and multiple file descriptors can refer to a single `struct file`. `struct file` in turn refers to a `struct inode`, which represents an actual "something" on disk. Each file system driver stores a bunch of info in the `struct inode`.

In addition to that tracking, there is generally a bunch of fs-specific data structures to handle lookup of names (in the typical case, these are essentially cached versions of what is stored on disk).

derefr · on April 25, 2017

> there aren't per-filesystem variations and I assume permission checking is not done in file system code

There are! Keep in mind that "filesystems" includes things like NFS and SMB. The local kernel has no ultimate authority over security policy of a remotely-mounted filesystem, right? In open(2), the kernel has to ask the filesystem what's what, because the filesystem might know something the kernel doesn't.

Now, the interpretation of stat(2) values (UID/GID, ACLs, etc.) are up to the VFS—but the filesystem is the one that created that stat struct when the kernel called its stat(2) impl, and it expects to get that stat-struct passed back with no changes to open(2), and then make the decision for itself what those stat-struct members mean.

Which is to say, it's perfectly possible to write a filesystem that says you have 0000 permissions on a file, but which still lets you open(2), read(2), write(2), readdir(2), etc. that file! It's up to the filesystem to enforce file permissions or ACLs, and it can do that however it wishes; stat(2) is just an indication, in a common VFS language, of the policy the filesystem is (probably) going to enforce. It's not a baton passed to the kernel to do the enforcement for it. Linux has no equivalent to NT's kernel-object ACLs.

> File modes are more a "PATH" thing (there aren't per-filesystem variations and I assume permission checking is not done in file system code), although the file system must allocate the space to save the file mode bits and implement the VFS API.

Ah, sorry, I didn't mean file permissions; was a typo. I mean things like, the process on the other end of a pipe closing its write end, will cause your read(2) call to that pipe's FD to fail, because the IO permissions (not file permissions) on the FD have changed between the two successive read(2) calls.

When your open(2) impl gets called, you receive a stat(2) [that you previously created yourself when the kernel called your stat(2) implementation], and a set of open(2) flags, compare the two, and decide whether to grant each requested permission from open(2) given the stat-struct. Essentially, the open(2) impl is a pure mask-function on the requested flags, to determine what permissions actually end up put into the descriptor. (Conveniently, the kernel then returns a permissions error if it doesn't get returned the perms it asks for. But it could always end up with more perms than it asked for!)

Then, later, the kernel can modify that set of IO permissions without telling you, and your next read(2) or write(2) might get called with different IO permissions.

Another interesting fact: in the VFS struct file_operations (where you put your pointers to your filesystem's implementations of file operations), there is no member representing close(2). No FS-driver-level function gets called by the kernel in response to close(2)! Instead:

• There is a flush(2) that gets passed a file struct, to indicate to the FS that a given file's handle has been closed—but this is only there so that, if the file is part of a filesystem with synchronous-commit (e.g. NFS in sync mode), closing the file will trigger a flush of the entire device. This is stateless and idempotent; a given file-struct might get flush(2)ed any number of times. It's there to ask the the file's extents' backing store to checkpoint itself, not to do anything with the file itself.

• There is a release(2) that gets called when all handles to a file have been closed—i.e. when the kernel's "open(2) refcount" on the file-struct drops to zero. If the filesystem, say, caches some things about the file when you open(2), you can release that cache-entry on release(2).

Notice that neither of these operations has semantics that would let you clean up local state allocated in a table keyed off anything passed to open(2), because there's no call that happens 1:1 with open(2) calls. Thus, you really can't key local state to a struct-file in a way where you can later look it up again. And there's nowhere inside a struct-file to stash a key for your local state, either. So, like I said, read(2) and write(2) are "stateless."

jstimpfle · on April 25, 2017

> There are! Keep in mind that "filesystems" includes things like NFS and SMB...

Of course - I would call that "augmentations". But still each has to implement the dreaded POSIX modes to be compliant.

Thanks for the thing about there not being any state for open files in the filesystems themselves. I browsed a bit around the LXR and couldn't find any. That's insightful! (and I think it's a sensible design choice)

codys · on April 25, 2017

One of the things filesystems provide is a mapping from paths to data (files, directories, etc). How they provide this varies (tree of path elements, etc), but saying they don't think about paths (and implying they only think about path elements) is misleading. Nothing prevents a file system from simply putting all it's path to data mappings into a hash table (except the slight difficulty in implementing some common filesystem operations).

loeg · on April 25, 2017

That would make for some very expensive lookup operations on subdirectories (you have to scan the entire hash table) and would entirely preclude hardlinks. So yeah, usually people do not write byzantine filesystems :-).

codys · on April 25, 2017

I mean, you could add some indirection & some sort of directory data, if desired to allow hard links & get directory listing perf reasonable.

This was mainly an example to show that "paths" in some sense are (or could be) handled by file systems, not an actual design I'd aspire to :)

loeg · on April 25, 2017

Fair enough. In fact, I think Windows leaves path parsing and component lookup to individual filesystems. (Hence one filesystem per drive letter, and no mount points.)

ygra · on April 25, 2017

Windows has mount points.

loeg · on April 25, 2017

I don't think Windows NT did, which is about where my Windows knowledge ends.

detaro · on April 25, 2017

but AFAIK at a filesystem level: NTFS only?

derefr · on April 25, 2017

Maybe on different OSes, but like I said above, on Linux, many operations will literally just pass a filesystem driver an inode. Your hypothetical driver would need to keep a double-mapping: from inodes to paths, and then from paths to whatever else. At which point, why bother with the second mapping? It's just slowing things down.

codys · on April 25, 2017

The typical (generalized) setup is paths -> inodes (called "refers to data" or ""something"" in my other comments to avoid using the weird terminology we've somehow adapted for naming file system things) and inodes -> actual data on the disk. Whether you actually have that setup on disk is up to the fs format, but regardless of the format it is typically useful (and as you note, generic file system interfaces like the one in linux do require this) to have this extra indirection when tracking things on the software side (ie: not necessarily the storage side, though many file systems do have this type of indirection because it is also useful on the storage side).

> Your hypothetical driver would need to keep a double-mapping: from inodes to paths, and then from paths to whatever else.

As I've noted in another comment, inodes (`struct inode` in linux) typically contain a lot of information. No additional mapping would be required.

derefr · on April 25, 2017

My point was that paths are part of Linux VFS, and that lookup is already done and invisible when the calls reach the FS driver. So there's no point in implementing an FS driver in terms of paths. It's redundant working-backward from something that was already done for you.

dom0 · on April 25, 2017

> Maybe on different OSes,

Perhaps on a mainframe OS or something that doesn't have a VFS, but all VFS pretty much work the same way conceptually (BSD/nix, Linux, Windows).

Animats · on April 25, 2017

All the UNIX/Linux/POSIX functions which take output "char *" params without a length should have been moved to deprecated header files a long time ago. Like 1990 or so. It's not too late.

Tempest1981 · on April 26, 2017

Sometimes I dream of a world where C had a built-in string type. Imagine how much time could have been saved, and how many crashes prevented.

loeg · on April 25, 2017

The GNU Hurd approach to PATH_MAX is to set it to something ridiculous like SIZE_MAX, something that cannot possible be allocated, to illustrate to programmers that it is a fiction.

I don't think that's necessarily the best approach, but it matches reality more closely than typical Linux/BSD values (1024 or 4096).

jwilk · on April 25, 2017

Hurd doesn't define PATH_MAX at all:

https://www.gnu.org/software/hurd/hurd/porting/guidelines.ht...

loeg · on April 25, 2017

You are correct. I was thinking of the very similar standard C99 constant, FILENAME_MAX:

https://www.gnu.org/software/libc/manual/html_node/Limits-fo...

the_mitsuhiko · on April 25, 2017

That seems like a bad idea because people will then use inconsistent max path lengths.

loeg · on April 25, 2017

The idea is you need to allocate and resize larger if your initial buffer was too small. It doesn't matter what inconsistent length you start with as long as you scale it up as needed.

ygra · on April 25, 2017

I actually like that some Windows APIs you pass a buffer to will tell you if the buffer was too small and the necessary size to accommodate the result, requiring a bit less guessing.

dom0 · on April 25, 2017

It's still a race though - you have to loop (possibly indefinitely).

ygra · on April 25, 2017

Ok, didn't think of another thread changing the directory. Still, there's an upper bound around 32 kilo code units, so not indefinitely. Unless UNC paths are not bounded, don't know about that right now.

dom0 · on April 25, 2017

True, thought the wrong way about it.

wfunction · on April 25, 2017

It's funny because I know this but I'm pretty sure I forget it when coding...

the_mitsuhiko · on April 25, 2017

That's a DOS in the kernel waiting to happen.

loeg · on April 25, 2017

Userspace is responsible for constructing paths. Also, Hurd has some interesting views on what belongs in the kernel (and anything to do with paths probably doesn't). (I don't use or hack on Hurd, it's just an interesting example for PATH_MAX.)

the_mitsuhiko · on April 25, 2017

I will still need to pass thebpath to the kernel. If I ipen a 3GB filename'd file the kernel will have an issue.

loeg · on April 25, 2017

Why do you think it would be a problem?

jerf · on April 25, 2017

What is the correct consistent max path length?

jstimpfle · on April 25, 2017

It simplifies implementations. For example, typically you need to be able to store paths in a consecutive area of memory or storage. Knowing that paths don't exceed a certain size means you can know that a particular allocation strategy (like static allocation, for a simple example) will be sufficient.

jerf · on April 25, 2017

I don't question that using wrong values can simplify implementations. But if we're going to complain about people using inconsistent lengths, what's the consistent one they should be using? If you can't name one than the inconsistency criticism is null.

As long as I'm already typing, the real problem here is that path lengths do not in fact fit into any practical statically-sized buffer. While statically-sized buffers have their utility and aren't going anywhere, they're an optimization, or a special case, in a world that is fundamentally dynamically-sized. In the 1970s it made sense for hardware reasons to treat "static buffer" as the default case and make dynamically-sized buffers the harder case, but that's not correct today on any level, code correctness, developer convenience, API cleanliness, ease of use, anything.

Yes, I understand that this is a kernel API and I am aware I'm making a deep structural criticism of UNIX kernels here, and that fixing it would be a significant challenge. It would take quite a bit of fundamental rethinking of how things work to do something like pass the kernel a function pointer to allocate a given bit of user-space RAM or something, or pass it a static buffer with a function to call on overflow, or something, and I am not claiming this would be easy.

(Note the distinction between "dynamically sized" and "arbitrarily large"; a 3GB path is 99.9999+% either some sort of bug or an attack, so having a total max path length has advantages too. But you can do something like make it 1MB, something very generous, without making everything that uses paths allocate 1MB per path.)

jstimpfle · on April 25, 2017

I can understand your side, but on the other hand... I'm sure there are a lot of (sub)systems even today that just can't have objects larger than 4096 bytes. Also has 4096 bytes ever been restricting to you? I once heard from someone doing evil things and hitting NAME_MAX, but there is no problem just treating that rare case as some sort of I/O error. Everything IO can always fail anyway.

klodolph · on April 25, 2017

Mostly it's for stuff like readlink(), which returns a path in a user-supplied buffer. How big should the buffer be? The Linux man page actually suggests doing lstat() to size the buffer, with a note that this is susceptible to a race condition!

dom0 · on April 25, 2017

readlink(2) takes a buffer and the buffer's size, so it's safe. You just enter a race-controlled loop, as always.

klodolph · on April 25, 2017

I should have been clearer: the complaint was about the API usability, not about safety. The POSIX API is a total mess when it comes to things like this.

TheAceOfHearts · on April 25, 2017

This post got me curious, so I did a quick search on macOS 10.12. I found the values defined in "/System/Library/Frameworks/Kernel.framework/Headers/sys/syslimits.h". PATH_MAX is 1024, and NAME_MAX is 255.

There's also an amusing todo question that looks like it might've been there for at least close to 20 years now:

    #define	OPEN_MAX		10240	/* max open files per process - todo, make a config option? */

jstimpfle · on April 25, 2017

The essence of the article: PATH_MAX applies to the syscall interface. It's not related to file systems. Paths aren't a file system thing, but simply a convenient means of addressing files. Basically they are URLs for local resources.

And that totally makes sense once you understand that files are basically "objects" (in the OO sense) identified by inodes instead of memory addresses. A file system implements the graph of these objects (linked by special file objects called directories). The fact that one can cross file system boundaries using file paths also indicates that file paths are none of a file system's business.

masklinn · on April 25, 2017

> It's not related to file systems. Paths aren't a file system thing, but simply a convenient means of addressing files.

> The fact that one can cross file system boundaries using file paths also indicates that file paths are none of a file system's business.

The filesystem knows about file names, stores them, and puts limits on them (often 255 code units though some are lower — FAT16's 8.3, HFS's 31 — and some are higher — Reiser4's 3976 bytes).

A file path is nothing but a concatenation of a bunch of file names and separators ergo file paths are, in fact, an FS's business.

And while that's mostly fallen out of style there are still length-limited-path filesystems: ISO-9660 and UDF for instance.

jstimpfle · on April 25, 2017

> A file path is nothing but a concatenation of a bunch of file names and separators ergo file paths are, in fact, an FS's business.

This is a non sequitur.

teddyh · on April 25, 2017

> This constant [PATH_MAX] is defined by POSIX…

Well, it’s allowed by POSIX. A POSIX compatible system doesn’t have to define it if it has no such inherent restriction on path lengths. Indeed, the GNU Hurd does not have such a restriction, and consequently does not define it. This leads to many porting adventures for those trying to compile a program on GNU Hurd, believing their source code to be correct for any POSIX-compliant system.

recentdarkness · on April 25, 2017

And to add to the confusion unix domain sockets have a maximal length of something between 92 a 108. That's an implementation detail of platform it's running on. This in particular has been biting me already.

leni536 · on April 25, 2017

I played around with glibc's getcwd() some time ago. With strace one can easily see how getcwd() works. If the current path is larger than PATH_MAX then the getcwd syscall fails. Then as I recall glibc uses '..'s recursively so it never has to call a syscall with a long relative path.

If there is a non-user-readable directory in the path then the fallback method fails but the getcwd syscall works if the path is short enough.

Bash also "cheats" by caching the working directory and updating it on 'cd' commands.

manwe150 · on April 25, 2017

If I understand the conclusion of the article right, it's that we should actually just use PATH_MAX? In particular, he points to the glibc implementation of realpath as being very correct. But it (like the man page description of it says), appears to prefer to use the hard-coded value of PATH_MAX, unless that value is unavailable and it is forced to query for the kernel _PC_PATH_MAX value instead.

That's not what I would have expected. Did I miss something obvious?

josteink · on April 25, 2017

That's tricky indeed. And doing things properly seems quite involved.

For now I'll keep my limits.h. At least until I get a real-world bug-reports telling me this is causing real-world issues :)