Unix shells and the current directory

ChrisSD · on Nov 26, 2023

> Complicating this picture is shells. For a long time, many shells have kept track of a name for their current directory themselves, often materializing this in the '$PWD' environment variable. The shell has to keep track of this name as a text string or the rough equivalent, which makes it potentially less accurate than the kernel's version. However, it has some advantages, because unlike the kernel, the shell knows what name you typed in order to get to the directory, which may not be the actual filesystem name of the directory because of things like symbolic links. Shells often use this knowledge so that names like '..' and even '.' work on the text version, not the filesystem version.

Related reading: Lexical File Names in Plan 9 or Getting Dot-Dot Right (https://9p.io/sys/doc/lexnames.html)

seligman99 · on Nov 26, 2023

Along the same lines, in the Windows world:

The current directory is managed with SetCurrentDirectory/GetCurrentDirectory, however the cmd.exe command-line shell also stores the current directory for each drive in an environment variable like "=C:", and the CRT and shell hides all environment variables that start with a "=".

It gets mightily confused if these two concepts of current directory ever diverge.

c0pium · on Nov 26, 2023

Who is still using cmd.exe? I understand that there are system processes that still need it, but if you see a human using cmd in the year of our lord 2023, that’s a cry for help!

schemescape · on Nov 26, 2023

What should I be using instead?

I don’t mind cmd.exe and it launches instantly (same reason I frequently use notepad.exe for quick edits). That latter quality is very hard to find :)

Edit: but if you meant for scripting, yeah, batch files are terrible.

switch007 · on Nov 26, 2023

On my Windows 10 with no profile it takes 1-2 seconds (Ryzen 3600/M2/32GB RM). Like, what is it doing? I get annoyed if bash on Linux takes like 250ms.

schemescape · on Nov 27, 2023

Opening cmd.exe or PowerShell (or something else)?

switch007 · on Nov 27, 2023

PowerShell

hiccuphippo · on Nov 26, 2023

Personally I use the bash that comes with git for Windows. I only need to use cmd.exe for creating symlinks since mklink is a built-in.

PhilipRoman · on Nov 26, 2023

It's installed everywhere on any version of windows and works fine for interactive tasks (personally I wouldn't write anything but the simplest scripts for it, anything with for loops is a big no-no)

vel0city · on Nov 26, 2023

Powershell is installed everywhere on any version of Windows that still receives security updates.

toast0 · on Nov 26, 2023

I do. I don't like PowerShell (and it took me years to realize it wasn't a diagnostic tool for power management), and I find bash for Windows to be ill fitting. I don't do a lot of stuff in the command line on Windows, so working like it has for decades is a plus.

comex · on Nov 26, 2023

> Shells often use this knowledge so that names like '..' and even '.' work on the text version, not the filesystem version.

Which has the odd result that '..' behaves differently between shell builtins and normal commands. `cd ..; ls` uses the text version, but `ls ..` uses the filesystem version. `cat < ../x` uses the text version, but `cat ../x` uses the filesystem version.

I like the text behavior in theory, but this inconsistency is weird enough that I question the benefit of having the text behavior at all.

Someone · on Nov 26, 2023

Apart from history and standards, what is the reason for having the path to the current directory or even the current directory known to the kernel?

The shell already seems to track it, so presumably, that logic could have been part of the standard library, and get tracked from user-mode.

If the kernel has to track the current directory (e.g. for performance reasons, to make accessing files relative to a particular directory more efficient), wouldn’t just remembering the device ID and inode be easier for the kernel?

Alternatively, there could be kernel calls taking (device, inode) pairs, and the kernel could be completely ignorant of the ‘current directory’ concept.

That can work; except for naming them ‘directory ID’ instead of ‘inode’, that’s what the first Mac OS hierarchical file system did; paths were second-class citizens here.

o11c · on Nov 26, 2023

For accessing files, the kernel does keep an open FD of sorts (better than "dev, inode" pair).

But you can't punt this entirely to the shell - the shell has to look at the kernel's idea of the current directory name at startup; all the in-shell tracking can only be done for subsequent changes.

One major caveat is that the kernel's API stupidly relies on a single-step global record and is limited to one page (usually 4096 bytes), rather than reconstructing it component-by-component. So if you change into a deeply nested folder, `getcwd` falls back to the `open(".."); readdir` loops. Of course if `PWD` is set correctly it can be used, but if it's not canonical you might have to do the nasty version later.

A more subtle caveat is all the possible end cases:

* you reach the current mount namespace's sense of `/`

* you reach the current mount namespace's sense of `//`, if your environment supports such a thing (note that `readdir` likely fails at the last level though!).

* you reach some other sense of `/` (e.g. from an FD kept open across `chdir`, or an FD passed across a Unix socket from a different mount namespace)

* the directory was not found in the readdir loop (a directory moved due to a race condition, or special filesystems that aren't fully enumerable - this includes /proc/ if you use a thread ID directly - this has a different inode than the main PID which it mostly acts like!).

siebenmann · on Nov 26, 2023

The current directory is a long-standing Unix concept, so you'd have to trace its history back quite far to hear arguments about why it was there. One obvious reason is that relative paths are convenient for all sorts of reasons and they require a point to be relative to, which is basically 'the current directory' in some form.

The kernel knowing the name for the current directory is not specific to current directories; it is part of a general system of caching the name mappings for directory entries ('dnodes' in Linux, a 'name cache' in FreeBSD). Unix kernels added these caches because Unix programs spend a lot of time looking up names, making the operation worth optimizing in general. Once you have a general name cache, you might as well pin the entries for actively used entities like current directories and open files so that they don't get expired out of the cache and you always know (some) name for them.

(One useful complexity of name caches is that you can cache negative entries, ie that a given name is not present in a directory. In the modern Unix shared library environment where shared libraries may be probed for in a whole collection of directories every time a program starts up, I suspect this saves a nice chunk of kernel CPU time.)

didntcheck · on Nov 26, 2023

Yeah, I remember being surprised by two things when I first started learning about Unix implementation

* That $PATH is just an ordinary env var, that many programs use by convention

* That CWD isn't, and is in fact a first-class kernel concept. I had assumed that it was just a conventional envvar that stdlibs prepended before passing absolute paths to syscalls

I'm sure there's good reasons why the other way wouldn't work, it just amused me that I'd got it wrong in both ways

toast0 · on Nov 26, 2023

A reference to the current directory is needed in order to open relative filenames. You could conceivably retain a string path rather than a reference, but the behavior would be different when directories are renamed or unlinked.

cryptonector · on Nov 26, 2023

The kernel doesn't have to know the current directory's path. It's enough that it know -and retain an open file reference to- the current directory's inode/dnode.

However, if you want things like DTrace, eBPF, or even just reading the /proc/PID/cwd symlink to be useful, it helps to cache the actual path in the kernel. A DTrace/eBPF script will not be able to loop to chase ..s, much less will it be able to do the I/O needed to work out the cwd.

The same applies to the names of the files that each FD refer to.

Caching these things is just for observability.

benou · on Nov 26, 2023

My guess is kernel needs to know the current directory of a process so that when said process tries to open a file without an absolute path (eg. just "file.txt" and not "/tmp/file.txt"), it can open "$CWD/file.txt".

This must be tracked by kernel, because not all syscalls go through libc, you can issue the open syscall directly from a process.

There might be other reasons, but I'd bet it's the main one.

tyingq · on Nov 26, 2023

Where it exists, there's also /proc/self/cwd, so you can do: readlink -e /proc/self/cwd

rezonant · on Nov 26, 2023

While readlink is the correct way to just read the link, I often just use `file ...` since it will also show the symbolic link destination (on GNU at least).

didntcheck · on Nov 26, 2023

ls -l also works. I often do

  ls -l /proc/`pidof foo`/fd

to see what files a taciturn process is working on

bobbyi · on Nov 27, 2023

lsof is also good for seeing what files a process is using https://www.thegeekstuff.com/2012/08/lsof-command-examples/

demondemidi · on Nov 27, 2023

After reading this, I recall how weird it was to me to think of a process as a person inhabiting a planetary directory tree. I think this metaphor is what kept me from really understanding processes the first time I touched an AIX system coming from my Apple DOS / MS-DOS world. Prior to meeting AIX, there was no persistent directory structure; my first computer didn't even have a tape drive (Commodore PET), so the analogy never even took shape.

I mean, the article talks about the "what", but it makes me wonder about the "why".

1vuio0pswjnm7 · on Nov 26, 2023

"Sometimes people then write shell scripts and other code that assumes '$PWD' is accurate if it's present, which is not necessarily true."

Unless one only wants the current directory name at the start of a script why not just use the builtin pwd command, $(pwd). Or getcwd() if it's "other code".

o11c · on Nov 27, 2023

`$()` creates a subshell (a whole separate process) so is significantly slower than mere string manipulation.

`$PWD` is always set accurately at startup (for both interactive and non-interactive shells) in bash, dash, zsh, ksh93, mksh, and busybox ash.

So I'm really not sure where this assumption can be violated.