Hacker News new | past | comments | ask | show | jobs | submit login
Unix shells and the current directory (utcc.utoronto.ca)
99 points by ingve on Nov 26, 2023 | hide | past | favorite | 26 comments



> Complicating this picture is shells. For a long time, many shells have kept track of a name for their current directory themselves, often materializing this in the '$PWD' environment variable. The shell has to keep track of this name as a text string or the rough equivalent, which makes it potentially less accurate than the kernel's version. However, it has some advantages, because unlike the kernel, the shell knows what name you typed in order to get to the directory, which may not be the actual filesystem name of the directory because of things like symbolic links. Shells often use this knowledge so that names like '..' and even '.' work on the text version, not the filesystem version.

Related reading: Lexical File Names in Plan 9 or Getting Dot-Dot Right (https://9p.io/sys/doc/lexnames.html)


Along the same lines, in the Windows world:

The current directory is managed with SetCurrentDirectory/GetCurrentDirectory, however the cmd.exe command-line shell also stores the current directory for each drive in an environment variable like "=C:", and the CRT and shell hides all environment variables that start with a "=".

It gets mightily confused if these two concepts of current directory ever diverge.


Who is still using cmd.exe? I understand that there are system processes that still need it, but if you see a human using cmd in the year of our lord 2023, that’s a cry for help!


What should I be using instead?

I don’t mind cmd.exe and it launches instantly (same reason I frequently use notepad.exe for quick edits). That latter quality is very hard to find :)

Edit: but if you meant for scripting, yeah, batch files are terrible.


On my Windows 10 with no profile it takes 1-2 seconds (Ryzen 3600/M2/32GB RM). Like, what is it doing? I get annoyed if bash on Linux takes like 250ms.


Opening cmd.exe or PowerShell (or something else)?


PowerShell


Personally I use the bash that comes with git for Windows. I only need to use cmd.exe for creating symlinks since mklink is a built-in.


It's installed everywhere on any version of windows and works fine for interactive tasks (personally I wouldn't write anything but the simplest scripts for it, anything with for loops is a big no-no)


Powershell is installed everywhere on any version of Windows that still receives security updates.


I do. I don't like PowerShell (and it took me years to realize it wasn't a diagnostic tool for power management), and I find bash for Windows to be ill fitting. I don't do a lot of stuff in the command line on Windows, so working like it has for decades is a plus.


> Shells often use this knowledge so that names like '..' and even '.' work on the text version, not the filesystem version.

Which has the odd result that '..' behaves differently between shell builtins and normal commands. `cd ..; ls` uses the text version, but `ls ..` uses the filesystem version. `cat < ../x` uses the text version, but `cat ../x` uses the filesystem version.

I like the text behavior in theory, but this inconsistency is weird enough that I question the benefit of having the text behavior at all.


Apart from history and standards, what is the reason for having the path to the current directory or even the current directory known to the kernel?

The shell already seems to track it, so presumably, that logic could have been part of the standard library, and get tracked from user-mode.

If the kernel has to track the current directory (e.g. for performance reasons, to make accessing files relative to a particular directory more efficient), wouldn’t just remembering the device ID and inode be easier for the kernel?

Alternatively, there could be kernel calls taking (device, inode) pairs, and the kernel could be completely ignorant of the ‘current directory’ concept.

That can work; except for naming them ‘directory ID’ instead of ‘inode’, that’s what the first Mac OS hierarchical file system did; paths were second-class citizens here.


For accessing files, the kernel does keep an open FD of sorts (better than "dev, inode" pair).

But you can't punt this entirely to the shell - the shell has to look at the kernel's idea of the current directory name at startup; all the in-shell tracking can only be done for subsequent changes.

One major caveat is that the kernel's API stupidly relies on a single-step global record and is limited to one page (usually 4096 bytes), rather than reconstructing it component-by-component. So if you change into a deeply nested folder, `getcwd` falls back to the `open(".."); readdir` loops. Of course if `PWD` is set correctly it can be used, but if it's not canonical you might have to do the nasty version later.

A more subtle caveat is all the possible end cases:

* you reach the current mount namespace's sense of `/`

* you reach the current mount namespace's sense of `//`, if your environment supports such a thing (note that `readdir` likely fails at the last level though!).

* you reach some other sense of `/` (e.g. from an FD kept open across `chdir`, or an FD passed across a Unix socket from a different mount namespace)

* the directory was not found in the readdir loop (a directory moved due to a race condition, or special filesystems that aren't fully enumerable - this includes /proc/ if you use a thread ID directly - this has a different inode than the main PID which it mostly acts like!).


The current directory is a long-standing Unix concept, so you'd have to trace its history back quite far to hear arguments about why it was there. One obvious reason is that relative paths are convenient for all sorts of reasons and they require a point to be relative to, which is basically 'the current directory' in some form.

The kernel knowing the name for the current directory is not specific to current directories; it is part of a general system of caching the name mappings for directory entries ('dnodes' in Linux, a 'name cache' in FreeBSD). Unix kernels added these caches because Unix programs spend a lot of time looking up names, making the operation worth optimizing in general. Once you have a general name cache, you might as well pin the entries for actively used entities like current directories and open files so that they don't get expired out of the cache and you always know (some) name for them.

(One useful complexity of name caches is that you can cache negative entries, ie that a given name is not present in a directory. In the modern Unix shared library environment where shared libraries may be probed for in a whole collection of directories every time a program starts up, I suspect this saves a nice chunk of kernel CPU time.)


Yeah, I remember being surprised by two things when I first started learning about Unix implementation

* That $PATH is just an ordinary env var, that many programs use by convention

* That CWD isn't, and is in fact a first-class kernel concept. I had assumed that it was just a conventional envvar that stdlibs prepended before passing absolute paths to syscalls

I'm sure there's good reasons why the other way wouldn't work, it just amused me that I'd got it wrong in both ways


A reference to the current directory is needed in order to open relative filenames. You could conceivably retain a string path rather than a reference, but the behavior would be different when directories are renamed or unlinked.


The kernel doesn't have to know the current directory's path. It's enough that it know -and retain an open file reference to- the current directory's inode/dnode.

However, if you want things like DTrace, eBPF, or even just reading the /proc/PID/cwd symlink to be useful, it helps to cache the actual path in the kernel. A DTrace/eBPF script will not be able to loop to chase ..s, much less will it be able to do the I/O needed to work out the cwd.

The same applies to the names of the files that each FD refer to.

Caching these things is just for observability.


My guess is kernel needs to know the current directory of a process so that when said process tries to open a file without an absolute path (eg. just "file.txt" and not "/tmp/file.txt"), it can open "$CWD/file.txt".

This must be tracked by kernel, because not all syscalls go through libc, you can issue the open syscall directly from a process.

There might be other reasons, but I'd bet it's the main one.


Where it exists, there's also /proc/self/cwd, so you can do: readlink -e /proc/self/cwd


While readlink is the correct way to just read the link, I often just use `file ...` since it will also show the symbolic link destination (on GNU at least).


ls -l also works. I often do

  ls -l /proc/`pidof foo`/fd
to see what files a taciturn process is working on


lsof is also good for seeing what files a process is using https://www.thegeekstuff.com/2012/08/lsof-command-examples/


After reading this, I recall how weird it was to me to think of a process as a person inhabiting a planetary directory tree. I think this metaphor is what kept me from really understanding processes the first time I touched an AIX system coming from my Apple DOS / MS-DOS world. Prior to meeting AIX, there was no persistent directory structure; my first computer didn't even have a tape drive (Commodore PET), so the analogy never even took shape.

I mean, the article talks about the "what", but it makes me wonder about the "why".


"Sometimes people then write shell scripts and other code that assumes '$PWD' is accurate if it's present, which is not necessarily true."

Unless one only wants the current directory name at the start of a script why not just use the builtin pwd command, $(pwd). Or getcwd() if it's "other code".


`$()` creates a subshell (a whole separate process) so is significantly slower than mere string manipulation.

`$PWD` is always set accurately at startup (for both interactive and non-interactive shells) in bash, dash, zsh, ksh93, mksh, and busybox ash.

So I'm really not sure where this assumption can be violated.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: