Fun fact, in bash you lose the exit codes of process subs. It doesn't even wait() on those processes in many cases.
So there is no way to abort a bash script if something like <(sort nonexistent) fails.
OSH lets you opt into stricter behavior:
$ osh -c
set -e
shopt -s oil:basic
diff <(sort left) <(sort nonexistent)
echo "should not get here"'
'
sort: cannot read: nonexistent: No such file or directory
diff <(sort left) <(sort nonexistent)
^~
[ -c flag ]:1: fatal: Exiting with status 2 (command in PID 29359)
In contrast, bash will just keep going and ignore failure.
You can also get all the exit codes with @_process_sub_status, which is analogous to PIPESTATUS.
> The process ID of the last executed background command in Bash is available as $!.
After a quick search it doesn't seem like this is an option for anything but the last executed background process. So if you have the typical `diff <(left) <(right)`, there is no easy way to get the pid of `<(left)`.
I suppose you could output the pid to a file, and use it from there, though I haven't tested if that would work.
Now you've got me curious which came first, "paste" or "zip".
Python's zip function was added in PEP 201 [1], which credits the Haskell 98 report with the name. The same function appears in all prior versions of the Haskell language that I could find, dating back to Haskell 1.0 in 1990.
On the other hand, the GNU coreutils version of paste[2] has copyright dates going all the way back to 1984, which presumably means it was derived from UNIX System V or an even earlier version.
The miranda language (which predates haskell) had a zip function. According to wikipedia, miranda was released in 1985, but I dont know if zip was in the standard miranda environment back then. The comments in the source code say:
The following is included for compatibility with Bird and Wadler (1988).
The normal Miranda style is to use the curried form `zip2'.
> zip :: ([*],[**])->[(*,**)]
> zip (x,y) = zip2 x y
which suggests a slightly later origination of zip.
if you know the separator, you can use `cut` -- not an unzip per se (doesn't give you the full separated lists, without calling it multiple times anyway) but approximately equal to the awk line
EDIT: Actually, I'd never committed 4th. It was just sitting here in my local scrap. So I guess I counted to three, realized at some later date I needed four, copied 3rd to 4th, then never committed.
Oh. Apparently I also have 'nth', which was exactly what I wanted:
> echo a b c d | nth 0 3
a d
The solution I came up with at the time:
#!/bin/bash
c=""
for n in "$@"
do
n=$((n+1))
args="${args}${c}\$$n"
c=', '
done
awk "{ print ${args}; }"
Nowadays I'd write it like:
awk "{ print $(echo "$@" | args math '1+{}' | replace '^' '$' | trim | joinlines ', '); }"
But I admit both are equally cryptic, and the former might even be more readable.
Hmm. Does xargs give you a way to skip the first arg? xargs -n 2 I {} almost works, but you’d need a way to extract a specific arg out of the {}. Which I just realized I don’t know how to do.
“(at 0 {})” would be a nice construction.
But then of course now you’re writing xargs -n 2 -I {} echo (at 1 {}) instead of just awk {print $2}, so it’s mission failed successfully.
Kinda-sorta related, you can pipe to a multivariable loop:
paste <(printf "a\nb\nc\n") <(printf "1\n2\n3\n") | while
read x y; do echo "$y -- $x"; done
Kinda silly for this example, but very useful in things like `kubectl` pipelines -- you can do something like this:
kubectl get pods --all-namespaces | awk '{print $1 " " $2}' | grep "doesntexist" | while read x y; do kubectl delete pod $y --namespace $x --wait=false; done
This lets you list all pods in a cluster, grep a selection of them by name, and delete them in a batch, regardless of which namespace they are in, while shuffling the order of args into different flags.
How do you know that the outputs a, b, and c correspond to outputs 1, 2, and 3 and whether they will always occur in the same orders. This technique seems like it's full of possibly invalid assumptions.
It’s almost identical to <(echo foo), but crucially it drains the output before performing the substitution. It’s like <(... | sponge), but somehow much more reliable.
I’ve used it in a few situations where process substation failed, though being old and feeble I can’t remember why it failed. But for example you can copy the file safely, knowing that it will contain the full output of the shell pipeline, unlike process substitution.
I don’t even know the name of =(...). Hmm. Sponge substitution?
/dev/fd/11 is a named pipe, whereas /tmp/zshbEF1D9 is a plain file. There are situations where named pipes fail, where you actually do need a plain file, and I can't think of a single one offhand. :)
"I've run into it in the past" is all I can say. And the above snippet at least gives one specific example of a hypothetical case.
> There are situations where named pipes fail, where you actually do need a plain file
In general: If the command you are passing the named pipe to expects a file which it can randomly access, a named pipe won’t work.
I ran into this today using a python tool, yamldiff. I was trying to diff a local yaml file with one from a cluster obtained via kubectl, eg:
yamldiff localManifest.yaml <(kubectl get … —output yaml)
and it failed with python telling me the file wasn’t seekable. Then I remembered this thread which I had initially read a few days ago and tried =() (since I use shell) — and it worked!
It doesn't even support named pipes - the function's been broken for years. Interestingly it doesn't actually seem to matter a great deal, at least now that temporary files are usually on a fast disk.
The usability of unixy shells generally falls down a cliff when you need to deal with more than just one input one output. The awkwardness in trying to shoehorn process substitution is just one of the examples of that.
Yeah. It's caused by this single standard input/output model. Even using standard error is unergonomic and leads to magic like 2>&1.
What if programs could have any number of input and output file descriptors and the numbers/names along with their contents and data types were documented in the manual? Could eliminate the need to parse data altogether. I remember Common Lisp had something similar to this.
My understanding is that they can. stdin, stdout and stderr are just pipes assigned to file descriptors 0,1 and 2 by convention. There's nothing stopping a program from having more file descriptors passed in, and some programs do. There's no just no standard convention on how to do it.
> Can you please cite examples? I've never seen any software do that.
With `gpg --verify` you can specify which file descriptor you want it to output to. I've previously used it to ensure a file is verified by a key that is trusted ultimate. Something that otherwise requires outputting to a file.
> Can you please cite examples? I've never seen any software do that.
Well, almost no one actually says “input is read from fd 0 (stdin) and 4”, for example. Generally you say “input is read from file1 and file2”, and then the user can pass “/dev/fd/0 /dev/fd/4” as arguments. This copes better when the parent process doesn’t want to close any of its inherited file descriptors.
Here's an example of how you would allow reading from stdin by using a different descriptor (3) for the input you're iterating over. I knew this was possible mainly because I also recently needed to receive user input while iterating over file output in bash.
> What if programs could have any number of input and output file descriptors and the numbers/names along with their contents and data types were documented in the manual?
What you’re describing is similar to one of the more common ways that programs are run on mainframes by using Job Control Language (JCL).
Named pipes and tee (or gnu parallel, depending on the problem) make this semantically much clearer. It's so much better than bracket-and-sed-hell spread out over different lines.
It doesn't necessarily -- I just meant that if you have <(<(...) <(...)) type structures then adding a punctuation-jumble like command in the middle of each subsubshell is a good way to murder readability quickly. Sed, and to a lesser extent, awk, tend to be good examples of tools that (can) use a _lot_ of brackets and symbols...
How does error handling together with this work? Can pipefail catch this or does one explicitly need to ‘wait’ for the background processes and check them there?
I remember seeing some academic work on extending shell sematics to more complicated pipe networks, but nothing particularly promising. In industry, I think that is generally the point where people pick up "real" programming language instead of trying to work in shell; on top of my head I imagine golang with its channels and goroutines to be particularly well suited for these sort of problems. I can't say if there is something in golang that shells could adapt somehow.
But of course you can do the same thing in 30 seconds that Go would take 30 minutes for. Especially if you’re trying to process-substitute a shell pipeline, not just one command.
Great read, but worth noting from the end of the article that "late feature" here means it was added in the early '90s. The late addition to Unix that surprises me is ssh. It was only invented in the late '90s. Before that everyone used unencrypted remote shells like telnet.
Encryption in general was in a pretty bad state in the '90s: original http was unencrypted and early mobile phone standards that are still in use have very weak encryption.
Everything was unencrypted until late 90s (and in many cases until late 00s). Email (both smtp and pop3/imap), irc, web, gopher, telnet, ftp, local disks, removable storage, network storage (smb/nfs etc), everything. Computing and the internet was much nicer place, there wasn't such an adversial attitude where everything would be broken just because its out there like today.
I started before CompuServe, Internet, or the internet were nouns.
It wasn't nicer back then, it was lazy and naïve.
3DES was widespread in the payment card industry, but the attitude towards protecting any/all parts of networks corresponding to the 7-layer OSI model was generally lax.
IPv4 public address ranges (mostly registered Class B's and C's) were wasted frivolously for internal corporate networks where they weren't suited or even necessary.
Unless they didn't know what they were doing, bank logins weren't unencrypted. Ever.
I and some lab peeps played with ARP and IP spoofing to steal each other's telnet sessions in the late 90's. It was obvious telnet, rcp, rsh, echo, char, finger, and nfs needed major reworking and/or abandonment.
Later, the Equifax hack broke SSN's as universal American private "UUIDs" (primary keys).
Things still broken as of 2022:
0. Without deploying 802.11x, DHCP by itself is still terrible because anyone spoof being a server and disrupt many communications on a LAN. Properly managed campus ELANs/WLANs should authenticate all WiFi and Ethernet connections equally and disconnect any misbehaviors at the port or AP-association level.
1. PII should be held by a secure, independent, nongovernmental nonprofit where it can be updated in one place and set access policies by the individual. Companies then can request access to it. That way, PII is treated more like medical records (PHI) and payment card info. For the most part, corporate customer data should be anonymized as much as possible by law.
2. There is no global universal standard identity / proximity card / secret keys HSM. Similarly, it should not be held or managed by any country, only issued by their organizations.
3. There is simultaneously too much anonymity for launching cyberattacks while not enough for protecting dissidents. Social media app operators should understand how much anonymity and identity-revealing/-proving is appropriate to ensure people invest-in and maintain a minimum amount of decency and empathy vs. cyberdisinhibitionism.
Yeah, on the time sharing Unix systems I would use in the 80s and 90s, everyone’s home directory (and most everything under it) was world readable by default. You could change the permissions, but most people didn’t.
I feel like those old folks who tell of a time when people didn’t bother to lock their doors at night.
The home directory of the 1980s was the github and Stackoverflow of today. When I had a problem I just run grep to see what others had done. There was no internet to ask anybody. And people did not do banking, store photos or anything like that on their computer. I guess mbox was read protected for group and others already back then.
But multiuser computers are much less the default then back then. Even kids have their own one because they need it in school (at least in this country).
Encryption is CPU-heavy and CPUs weren't nearly as fast then as they are now. Unix was developed on systems like a VAX which could do 1 MIPS (millions of instructions per second). For comparison an M1 chip can do about 10 trillion instructions per second. It just wasn't possible to encrypt data in real time like it is now.
For whatever reason I can never remember the syntax of <(command) and end up "rediscovering" it every year. It's seldom used but when it's needed it's rather elegant.
Another somewhat related useful bash feature is using this form:
With the caveat that the here string causes a tempfile to be written¹, so they're not quite equivalent. How much that matters for your use cases though is a different question, but it is worth thinking about if you're doing lots of repeated calls.
¹ With Bash v5 it may use a pipe if the data is small enough, but you can't guarantee people will have that because of GPLv3 phobia. I believe it is always a tempfile with zsh.
Fish tries to use FIFOs to emulate process substitution, and it leads to deadlock. Not sure why.
By default, Fish actually runs the processes in a strict sequence. But this is to avoid the above deadlock situation. And it therefore isn't process substitution.
To be sure, fish runs external processes in parallel. It's only internal functions which are serialized against each other. I'm hoping to lift this restriction eventually.
The claim is false; process substitution can be cobbed together with named fifos,* and those are "ancient".
Only problem that those are temporary objects that have to be created in the file system, and cleaned up.
However, temporary objects (files, no fifos) are also used in here doc implementations.
Process substitution is a late feature simply because the creativity juice in Unix (tm) dried up some time before the middle 1990's, leaving the FOSS reimplementations of Unix to carry the development torch.
Those projects had to balance among other goals like quality/robustness and compatibility.
(If we look at the quality of the FOSS tools compared to the Unix originals, we could also remark that "quality and robustness was late in coming to Unix". But we equivocate on Unix, because GNU stands for GNU is Not Unix!)
Features appearing in FOSS utilities like GNU Bash take time to make into Unix (tm).
Process substitution is not yet in the standard, therefore it is not in in fact in Unix (tm).
Shell scripting is a conservative activity. The language isn't very good and so improving it is like kicking a dead horse in some ways; the most important matter in any new shell release is that old scripts keep working. (Like configuration scripts for the build systems of nicer languages).
In practice I end up caching the output often. I have used process substitution but the iteration process feels more useful to me if I've slowly built up data and I can inspect the internal pieces each time and reuse them in different ways.
But I can see if it's relatively fast. I like it. I just don't end up using it often.
0. Process substitution is a potential DoS vector as it could take up all of RAM and/or disk space.
1. Also, not all commands are compatible with it, especially if they need rewinding or reopening. diff has issues with using it for both arguments often. It's likely the use of memory mapped files, but I could be wrong.
2. Shells ought to implement a flag for process substitution to allow temporary files to reside on disk for the lifetime of the command line. This way, it can operate on extremely large files.
An unfortunate thing is that process substitution does not work in git bash on Windows. (at least it was the case last time I tested; googling around I found a random comment in random github repo saying it's been fixed in 2.25 but I don't have a laptop handy to test it now).
There is no possible way to usefully answer that question as given; every shell has its own advantages and disadvantages. Plain bourne is universal but bare-bones, bash is mostly ubiquitous on Linux, zsh is powerful but just different enough to occasionally bite you, fish is very user friendly but doesn't even try to be compatible with anything else, ksh is a nice option and is built-in on BSDs, dash sucks for interactive work but is great for running scripts...
When everything is both nails and screws, a hammer isn't the best tool for everything (unless it has a screwdriver on the end and the hammer part is big enough to be actually useful as a hammer).
So there is no way to abort a bash script if something like <(sort nonexistent) fails.
OSH lets you opt into stricter behavior:
In contrast, bash will just keep going and ignore failure.You can also get all the exit codes with @_process_sub_status, which is analogous to PIPESTATUS.
(I should probably write a blog post about this: http://www.oilshell.org/blog/)