Hacker News new | past | comments | ask | show | jobs | submit login
Why “process substitution” is a late feature in Unix shells (utcc.utoronto.ca)
168 points by r4um on Jan 7, 2022 | hide | past | favorite | 82 comments



Fun fact, in bash you lose the exit codes of process subs. It doesn't even wait() on those processes in many cases.

So there is no way to abort a bash script if something like <(sort nonexistent) fails.

OSH lets you opt into stricter behavior:

    $ osh -c 
    set -e
    shopt -s oil:basic

    diff <(sort left) <(sort nonexistent)
    echo "should not get here"'
    '
    sort: cannot read: nonexistent: No such file or directory

    diff <(sort left) <(sort nonexistent)
                      ^~
    [ -c flag ]:1: fatal: Exiting with status 2 (command in PID 29359)

In contrast, bash will just keep going and ignore failure.

You can also get all the exit codes with @_process_sub_status, which is analogous to PIPESTATUS.

(I should probably write a blog post about this: http://www.oilshell.org/blog/)


> So there is no way to abort a bash script if something like <(sort nonexistent) fails.

The process ID of the last executed background command in Bash is available as $!.

  cat <(sort nonexistent)
  wait $! || echo fail
gives

  sort: cannot read: nonexistent: No such file or directory
  fail


> The process ID of the last executed background command in Bash is available as $!.

After a quick search it doesn't seem like this is an option for anything but the last executed background process. So if you have the typical `diff <(left) <(right)`, there is no easy way to get the pid of `<(left)`.

I suppose you could output the pid to a file, and use it from there, though I haven't tested if that would work.


A fun trick is using the `paste` command with process substitution to combine lines of output from different commands.

  ./cmd1
outputs:

  a
  b
  c

  ./cmd2
outputs:

  1
  2
  3

  paste <(./cmd1) <(./cmd2)
outputs:

  a 1
  b 2
  c 3

You can then use xargs to use each line as arguments for another command:

  paste <(./cmd1) <(./cmd2) | xargs -L1 ./cmd3
calls:

  ./cmd3 a 1
  ./cmd3 b 2
  ./cmd3 c 3


Should’ve been named zip. Is there an unzip?

But of course it can’t be named zip because that’s the compressor, so I’m really asking whether there’s an unpaste.

I guess the closest is awk {print $1;}, but it makes me wince every time I write it. (I already know that would give syntax errors as written here.)


Now you've got me curious which came first, "paste" or "zip".

Python's zip function was added in PEP 201 [1], which credits the Haskell 98 report with the name. The same function appears in all prior versions of the Haskell language that I could find, dating back to Haskell 1.0 in 1990.

On the other hand, the GNU coreutils version of paste[2] has copyright dates going all the way back to 1984, which presumably means it was derived from UNIX System V or an even earlier version.

[1]: https://www.python.org/dev/peps/pep-0201/

[2]: https://github.com/coreutils/coreutils/blob/master/src/paste...


The FreeBSD man page for `paste` claims it first appeared in Unix 32V (1979; derived from v7 Unix, and ancestral to both 4BSD and SysV).


Here's the commit from 1978: https://github.com/dspinellis/unix-history-repo/blob/Bell-32...

The comments refer to an "old 127 paste command" but I have no idea what that is.


I hadn't pursued it further than seeing that it was added in an 1978 commit.

Looking at it further, there are a few answers at https://minnie.tuhs.org/pipermail/tuhs/2020-January/019955.h...

> GWRL stands for Gottfried W. R. Luderer, the author of cut(1) and paste(1), probably around 1978. Those came either from PWB or USG

I can verify that neither cut nor past were in PWB 1: https://minnie.tuhs.org/cgi-bin/utree.pl?file=PWB1

> Also "127" was the internal department number for the Computer Science Research group at Bell Labs where UNIX originated


The miranda language (which predates haskell) had a zip function. According to wikipedia, miranda was released in 1985, but I dont know if zip was in the standard miranda environment back then. The comments in the source code say:

    The following is included for compatibility with Bird and Wadler (1988).
    The normal Miranda style is to use the curried form `zip2'.

    > zip :: ([*],[**])->[(*,**)]
    > zip (x,y) = zip2 x y
which suggests a slightly later origination of zip.


It's possible a lisp of some sort predated the Haskell version


if you know the separator, you can use `cut` -- not an unzip per se (doesn't give you the full separated lists, without calling it multiple times anyway) but approximately equal to the awk line


    alias awk1="awk {print $1;}"
Counting to 9, inclusive, is left as an exercise for the reader.


Amusingly, I apparently added these to scrap https://github.com/shawwn/scrap many years ago.

I named them 1st, 2nd, 3rd, and 4th. So I guess I never needed to count to 5, but I'm still two ahead of Valve. https://www.youtube.com/watch?v=jpw2ebhTSKs

EDIT: Actually, I'd never committed 4th. It was just sitting here in my local scrap. So I guess I counted to three, realized at some later date I needed four, copied 3rd to 4th, then never committed.

Oh. Apparently I also have 'nth', which was exactly what I wanted:

  > echo a b c d | nth 0 3
  a d
The solution I came up with at the time:

  #!/bin/bash

  c=""
  for n in "$@"
  do
    n=$((n+1))
    args="${args}${c}\$$n"
    c=', '
  done
  awk "{ print ${args}; }"

Nowadays I'd write it like:

  awk "{ print $(echo "$@" | args math '1+{}' | replace '^' '$' | trim | joinlines ', '); }"

But I admit both are equally cryptic, and the former might even be more readable.


    awp () {
     local a
     a=()
     while [[ -z ${1:#-*} ]]
     do
      a=($a $1)
      shift
     done
     local n
     n=$1
     shift
     awk $a '{print $'$n'}' $*
    }


Depending on the data, the right options for xargs would work, and there's the pr command I haven't used much.


Hmm. Does xargs give you a way to skip the first arg? xargs -n 2 I {} almost works, but you’d need a way to extract a specific arg out of the {}. Which I just realized I don’t know how to do.

“(at 0 {})” would be a nice construction.

But then of course now you’re writing xargs -n 2 -I {} echo (at 1 {}) instead of just awk {print $2}, so it’s mission failed successfully.


If you're just looking for a single column, then as zamfi mentioned, cut does the job:

  $ paste <(printf "a\nb\nc\n") <(printf "1\n2\n3\n") | cut -f2
  1
  2
  3
It makes sense that "cut" and "paste" would be inverse operations, but too bad that GUIs repurposed these words for clipboard operations.

But awk can do both, and does so in a single pass:

  $ paste <(printf "a\nb\nc\n") <(printf "1\n2\n3\n") | awk '{ print $1 > "letters"; print $2 > "numbers" }'
  $ cat letters
  a
  b
  c
  $ cat numbers
  1
  2
  3


Kinda-sorta related, you can pipe to a multivariable loop:

    paste <(printf "a\nb\nc\n") <(printf "1\n2\n3\n") | while 
read x y; do echo "$y -- $x"; done

Kinda silly for this example, but very useful in things like `kubectl` pipelines -- you can do something like this:

    kubectl get pods --all-namespaces | awk '{print $1 " " $2}' | grep "doesntexist" | while read x y; do kubectl delete pod $y --namespace $x --wait=false; done
This lets you list all pods in a cluster, grep a selection of them by name, and delete them in a batch, regardless of which namespace they are in, while shuffling the order of args into different flags.


Sorry, but my shellcheck OCD forces me to say "read without -r will mangle backslashes." ;)


Most useful comment I’ve read in months, for me. Cheers!


TIL you can pipe into files in awk. I've never come across this feature somehow. Is it commonly used?


What! I didn’t know awk could do that within single quotes. The heck, it implements a subset of sh?

Thanks!


awk has some basic redirection features.

gawk documentation: https://web.mit.edu/gnu/doc/html/gawk_6.html#SEC41

posix documentation: https://pubs.opengroup.org/onlinepubs/009695299/utilities/aw...


> Is there an unzip?

xargs printf


How do you know that the outputs a, b, and c correspond to outputs 1, 2, and 3 and whether they will always occur in the same orders. This technique seems like it's full of possibly invalid assumptions.


I love zsh’s take on process substitution:

  cat =(echo foo)
It’s almost identical to <(echo foo), but crucially it drains the output before performing the substitution. It’s like <(... | sponge), but somehow much more reliable.

I’ve used it in a few situations where process substation failed, though being old and feeble I can’t remember why it failed. But for example you can copy the file safely, knowing that it will contain the full output of the shell pipeline, unlike process substitution.

I don’t even know the name of =(...). Hmm. Sponge substitution?


Thought of an example.

vs

  cp =( ( echo foo; sleep 1; echo bar) | sponge) ) foobar
The first will complete instantly and result in a file named foobar but whose contents is either zero length or foo, depending on how unlucky you are.

The second takes one second to complete and results in a file named foobar containing foobar.

It’s strange that vanilla process substitution can’t seem to solve this at all.


I'm not sure what you're trying to say here.

With GNU cp on Linux, the commands

  cp <( ( echo foo; sleep 1; echo bar) | sponge ) foobar
  cp =( ( echo foo; sleep 1; echo bar) | sponge ) foobar
both have the same result: the file "foobar" containing the string "foo\nbar\n".


I didn't test it. You're right, I'm simply mistaken.

Here's a better (but awkward) illustration of the problem:

  :~$ python3 -c 'import sys; import shutil; shutil.copy(sys.argv[1], "foobar")' <
  (echo foo ; sleep 1 ; echo bar)
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/opt/homebrew/Cellar/python@3.9/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/shutil.py", line 427, in copy
      copyfile(src, dst, follow_symlinks=follow_symlinks)
    File "/opt/homebrew/Cellar/python@3.9/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/shutil.py", line 257, in copyfile
      raise SpecialFileError("`%s` is a named pipe" % fn)
  shutil.SpecialFileError: `/dev/fd/11` is a named pipe
  :~$ python3 -c 'import sys; import shutil; shutil.copy(sys.argv[1], "foobar")' =(echo foo ; sleep 1 ; echo bar)
Notice the difference:

  :~$ echo =(echo foo ; sleep 1; echo bar)
  /tmp/zshbusU5D
  :~$ echo <(echo foo ; sleep 1; echo bar)
  /dev/fd/11
/dev/fd/11 is a named pipe, whereas /tmp/zshbEF1D9 is a plain file. There are situations where named pipes fail, where you actually do need a plain file, and I can't think of a single one offhand. :)

"I've run into it in the past" is all I can say. And the above snippet at least gives one specific example of a hypothetical case.


> There are situations where named pipes fail, where you actually do need a plain file, and I can't think of a single one offhand. :)

My go-to use-case for =() is

  okular =(dot -Tpdf foo.dot)
PDF-viewers and its ilk typically don't work on non-files.


> There are situations where named pipes fail, where you actually do need a plain file

In general: If the command you are passing the named pipe to expects a file which it can randomly access, a named pipe won’t work.

I ran into this today using a python tool, yamldiff. I was trying to diff a local yaml file with one from a cluster obtained via kubectl, eg:

yamldiff localManifest.yaml <(kubectl get … —output yaml)

and it failed with python telling me the file wasn’t seekable. Then I remembered this thread which I had initially read a few days ago and tried =() (since I use shell) — and it worked!


I quite like how fish does process substitution.

  foo (bar | psub)
Instead of syntax it's just a plain function.

docs: https://fishshell.com/docs/current/cmds/psub.html source of psub.fish: https://github.com/fish-shell/fish-shell/blob/master/share/f...


While the syntax might be nice, strictly speaking, process substitution isn't currently possible in Fish: https://news.ycombinator.com/item?id=29845845


But fish's psub only supports file or named pipe, not the one mentioned in the post, which is described as the "the best way".


It doesn't even support named pipes - the function's been broken for years. Interestingly it doesn't actually seem to matter a great deal, at least now that temporary files are usually on a fast disk.


The usability of unixy shells generally falls down a cliff when you need to deal with more than just one input one output. The awkwardness in trying to shoehorn process substitution is just one of the examples of that.


Yeah. It's caused by this single standard input/output model. Even using standard error is unergonomic and leads to magic like 2>&1.

What if programs could have any number of input and output file descriptors and the numbers/names along with their contents and data types were documented in the manual? Could eliminate the need to parse data altogether. I remember Common Lisp had something similar to this.


My understanding is that they can. stdin, stdout and stderr are just pipes assigned to file descriptors 0,1 and 2 by convention. There's nothing stopping a program from having more file descriptors passed in, and some programs do. There's no just no standard convention on how to do it.


> My understanding is that they can. stdin, stdout and stderr are just pipes assigned to file descriptors 0,1 and 2 by convention.

Yes. Standard input and output are so ubiquitous shells were entirely designed around them with syntax that allows you to work with them implicity.

> There's nothing stopping a program from having more file descriptors passed in, and some programs do.

Can you please cite examples? I've never seen any software do that.

> There's no just no standard convention on how to do it.

Yeah. Such conventions would be nice. Perhaps a plan9 style virtual file system for all programs...

  program/
    input/
      x
      y
    output/
      x+y
      x-y


> Can you please cite examples? I've never seen any software do that.

With `gpg --verify` you can specify which file descriptor you want it to output to. I've previously used it to ensure a file is verified by a key that is trusted ultimate. Something that otherwise requires outputting to a file.

Something like this:

    tmpfifo="$(mktemp -u -t gpgverify.XXXXXXXXX)"
    gpg --status-fd 3 --batch --verify "$sigFile" "$file" 3>"$tmpfifo"
    grep -Eq '^\[GNUPG:] TRUST_(ULTIMATE|FULLY)' "$tmpfifo" || exit 1
GPG also has `--passphrase-fd`. Possibly other option too.


> Can you please cite examples? I've never seen any software do that.

Well, almost no one actually says “input is read from fd 0 (stdin) and 4”, for example. Generally you say “input is read from file1 and file2”, and then the user can pass “/dev/fd/0 /dev/fd/4” as arguments. This copes better when the parent process doesn’t want to close any of its inherited file descriptors.


Here's an example of how you would allow reading from stdin by using a different descriptor (3) for the input you're iterating over. I knew this was possible mainly because I also recently needed to receive user input while iterating over file output in bash.

https://superuser.com/a/421713


> What if programs could have any number of input and output file descriptors and the numbers/names along with their contents and data types were documented in the manual?

What you’re describing is similar to one of the more common ways that programs are run on mainframes by using Job Control Language (JCL).


do you know alternatives to that? I assume PowerShell but don't know if there's anything beyond that.


You can always setup some named pipes. E.g.:

mkfifo named_pipe

echo "Hi" > named_pipe &

cat named_pipe

I used to do this in bash scripts to keep them cleaner and the lines simpler.


Named pipes and tee (or gnu parallel, depending on the problem) make this semantically much clearer. It's so much better than bracket-and-sed-hell spread out over different lines.


Where does sed come into this?


It doesn't necessarily -- I just meant that if you have <(<(...) <(...)) type structures then adding a punctuation-jumble like command in the middle of each subsubshell is a good way to murder readability quickly. Sed, and to a lesser extent, awk, tend to be good examples of tools that (can) use a _lot_ of brackets and symbols...


How does error handling together with this work? Can pipefail catch this or does one explicitly need to ‘wait’ for the background processes and check them there?


Guessing that pipefail cares 0% of which filedescriptors are used and only the exit codes of processes


I remember seeing some academic work on extending shell sematics to more complicated pipe networks, but nothing particularly promising. In industry, I think that is generally the point where people pick up "real" programming language instead of trying to work in shell; on top of my head I imagine golang with its channels and goroutines to be particularly well suited for these sort of problems. I can't say if there is something in golang that shells could adapt somehow.


But of course you can do the same thing in 30 seconds that Go would take 30 minutes for. Especially if you’re trying to process-substitute a shell pipeline, not just one command.


Great read, but worth noting from the end of the article that "late feature" here means it was added in the early '90s. The late addition to Unix that surprises me is ssh. It was only invented in the late '90s. Before that everyone used unencrypted remote shells like telnet.

Encryption in general was in a pretty bad state in the '90s: original http was unencrypted and early mobile phone standards that are still in use have very weak encryption.


Everything was unencrypted until late 90s (and in many cases until late 00s). Email (both smtp and pop3/imap), irc, web, gopher, telnet, ftp, local disks, removable storage, network storage (smb/nfs etc), everything. Computing and the internet was much nicer place, there wasn't such an adversial attitude where everything would be broken just because its out there like today.


I started before CompuServe, Internet, or the internet were nouns.

It wasn't nicer back then, it was lazy and naïve.

3DES was widespread in the payment card industry, but the attitude towards protecting any/all parts of networks corresponding to the 7-layer OSI model was generally lax.

IPv4 public address ranges (mostly registered Class B's and C's) were wasted frivolously for internal corporate networks where they weren't suited or even necessary.

Unless they didn't know what they were doing, bank logins weren't unencrypted. Ever.

I and some lab peeps played with ARP and IP spoofing to steal each other's telnet sessions in the late 90's. It was obvious telnet, rcp, rsh, echo, char, finger, and nfs needed major reworking and/or abandonment.

Later, the Equifax hack broke SSN's as universal American private "UUIDs" (primary keys).

Things still broken as of 2022:

0. Without deploying 802.11x, DHCP by itself is still terrible because anyone spoof being a server and disrupt many communications on a LAN. Properly managed campus ELANs/WLANs should authenticate all WiFi and Ethernet connections equally and disconnect any misbehaviors at the port or AP-association level.

1. PII should be held by a secure, independent, nongovernmental nonprofit where it can be updated in one place and set access policies by the individual. Companies then can request access to it. That way, PII is treated more like medical records (PHI) and payment card info. For the most part, corporate customer data should be anonymized as much as possible by law.

2. There is no global universal standard identity / proximity card / secret keys HSM. Similarly, it should not be held or managed by any country, only issued by their organizations.

3. There is simultaneously too much anonymity for launching cyberattacks while not enough for protecting dissidents. Social media app operators should understand how much anonymity and identity-revealing/-proving is appropriate to ensure people invest-in and maintain a minimum amount of decency and empathy vs. cyberdisinhibitionism.


Yeah, on the time sharing Unix systems I would use in the 80s and 90s, everyone’s home directory (and most everything under it) was world readable by default. You could change the permissions, but most people didn’t.

I feel like those old folks who tell of a time when people didn’t bother to lock their doors at night.


The home directory of the 1980s was the github and Stackoverflow of today. When I had a problem I just run grep to see what others had done. There was no internet to ask anybody. And people did not do banking, store photos or anything like that on their computer. I guess mbox was read protected for group and others already back then.


I think on Ubuntu/Debian this is still the default; the UMASK in /etc/login.defs is 022.


But multiuser computers are much less the default then back then. Even kids have their own one because they need it in school (at least in this country).


I think OpenSuse too. I recently converted mine to user private group handling

https://access.redhat.com/documentation/en-us/red_hat_enterp...


Encryption is CPU-heavy and CPUs weren't nearly as fast then as they are now. Unix was developed on systems like a VAX which could do 1 MIPS (millions of instructions per second). For comparison an M1 chip can do about 10 trillion instructions per second. It just wasn't possible to encrypt data in real time like it is now.


Before Eternal September the Internet was trustworthy place. And criminals had not found it yet.


Telnet has encryption options, it just wasn't widely implemented.


For whatever reason I can never remember the syntax of <(command) and end up "rediscovering" it every year. It's seldom used but when it's needed it's rather elegant.

Another somewhat related useful bash feature is using this form:

    wc <<< 'a b c' 
instead of:

    echo 'a b c' | wc


With the caveat that the here string causes a tempfile to be written¹, so they're not quite equivalent. How much that matters for your use cases though is a different question, but it is worth thinking about if you're doing lots of repeated calls.

¹ With Bash v5 it may use a pipe if the data is small enough, but you can't guarantee people will have that because of GPLv3 phobia. I believe it is always a tempfile with zsh.


Didn't even know about process substitution, had been using fifo's to achieve this!


Fish tries to use FIFOs to emulate process substitution, and it leads to deadlock. Not sure why.

By default, Fish actually runs the processes in a strict sequence. But this is to avoid the above deadlock situation. And it therefore isn't process substitution.


To be sure, fish runs external processes in parallel. It's only internal functions which are serialized against each other. I'm hoping to lift this restriction eventually.

This SO answers a bit about the issues with FIFOs and fish shell. Basically it comes down to evaluating redirections before fork instead of after, which is because fish uses threads. https://stackoverflow.com/questions/61946995/ls-fifo-blocks-...


Thanks. Love the Fish shell!

Does this mean that every time you direct output to a file, like so:

  the_prog > the_file
The output of the_prog passes into Fish first and then into the_file? Because then if the write to the_file blocks then Fish would itself get blocked.


The claim is false; process substitution can be cobbed together with named fifos,* and those are "ancient".

Only problem that those are temporary objects that have to be created in the file system, and cleaned up.

However, temporary objects (files, no fifos) are also used in here doc implementations.

Process substitution is a late feature simply because the creativity juice in Unix (tm) dried up some time before the middle 1990's, leaving the FOSS reimplementations of Unix to carry the development torch.

Those projects had to balance among other goals like quality/robustness and compatibility.

(If we look at the quality of the FOSS tools compared to the Unix originals, we could also remark that "quality and robustness was late in coming to Unix". But we equivocate on Unix, because GNU stands for GNU is Not Unix!)

Features appearing in FOSS utilities like GNU Bash take time to make into Unix (tm).

Process substitution is not yet in the standard, therefore it is not in in fact in Unix (tm).

Shell scripting is a conservative activity. The language isn't very good and so improving it is like kicking a dead horse in some ways; the most important matter in any new shell release is that old scripts keep working. (Like configuration scripts for the build systems of nicer languages).

---

* See GNU Bash manual: https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash....: " Process substitution is supported on systems that support named pipes (FIFOs) or the /dev/fd method of naming open files. "


Yeah, but Saget could still have had a damaged heart from it regardless of Clapton's experience.


In practice I end up caching the output often. I have used process substitution but the iteration process feels more useful to me if I've slowly built up data and I can inspect the internal pieces each time and reuse them in different ways.

But I can see if it's relatively fast. I like it. I just don't end up using it often.


0. Process substitution is a potential DoS vector as it could take up all of RAM and/or disk space.

1. Also, not all commands are compatible with it, especially if they need rewinding or reopening. diff has issues with using it for both arguments often. It's likely the use of memory mapped files, but I could be wrong.

2. Shells ought to implement a flag for process substitution to allow temporary files to reside on disk for the lifetime of the command line. This way, it can operate on extremely large files.


An unfortunate thing is that process substitution does not work in git bash on Windows. (at least it was the case last time I tested; googling around I found a random comment in random github repo saying it's been fixed in 2.25 but I don't have a laptop handy to test it now).


it does


This is a great explanation! I wonder many times why we had to play with obscure xargs instead of being able to pipe a command output to an argument.


Which is the best she'll, fish or zsh?//


There is no possible way to usefully answer that question as given; every shell has its own advantages and disadvantages. Plain bourne is universal but bare-bones, bash is mostly ubiquitous on Linux, zsh is powerful but just different enough to occasionally bite you, fish is very user friendly but doesn't even try to be compatible with anything else, ksh is a nice option and is built-in on BSDs, dash sucks for interactive work but is great for running scripts...


When everything is both nails and screws, a hammer isn't the best tool for everything (unless it has a screwdriver on the end and the hammer part is big enough to be actually useful as a hammer).


tcl


Nawh, it's all about tk. ;)




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: