Hacker News new | past | comments | ask | show | jobs | submit login
Bash patterns I use weekly (will-keleher.com)
278 points by gcmeplz on Nov 23, 2021 | hide | past | favorite | 112 comments



> git bisect is the "real" way to do this, but it's not something I've ever needed

git bisect is great and worth trying; it does what you're doing in your bash loop, plus faster and with more capabilities such as logging, visualizing, skipping, etc.

The syntax is: $ git bisect run <command> [arguments]

https://git-scm.com/docs/git-bisect


Yes, git bisect is the way to go: in addition to the stuff you mentioned, his method only dives into one parent branch of merge commits. git bisect handles that correctly. A gem of a tool, git bisect.


Bisect also does a binary search so if you're looking for one bad commit amongst many others, you'll find it much more quickly than linearly testing commits, one at a time, until you find a working one.


How does bisect help in a large project? Seems like it would be best to use personal expertise to find it


I've only used bisect a handful of times, but I think it can be most useful in large projects with many contributors.

In smaller projects you can spot a bug and often surmise "I bet this is related to Fred's pull request yesterday that updated the Foo module", but in a larger repository where you don't have all the recent history in your head you might not even know where to start looking. In such cases being able to binary search through history is handy.


This is a hilariously absurd proposition. The existence of `git bisect` necessarily precludes an issue that personal expertise cannot find without some help.

Personal expertise tells you to use a tool like git bisect to find a problem, so you are more effective in your work. The same as it tells you to use gdb to debug a stack trace. Or do you eyeball the CPU executing instructions in realtime?

An absence of personal expertise will convince you that you are smart enough to do it on your own.


So let’s say there’s 50 commits in a short time between good and bad. Of those 50 there’s a few that are “maybes”. Bisect is completely useless


git-bisect is nice if you are looking for a git commit.

If you are looking for a limit or the failing part of a file have a look at: https://gitlab.com/ole.tange/tangetools/-/tree/master/find-f...


I've always had trouble getting `for` loops to work predictably, so my common loop pattern is this:

    grep -l -r pattern /path/to/files | while read x; do echo $x; done
or the like.

This uses bash read to split the input line into words, then each word can be accessed in the loop with variable `$x`. Pipe friendly and doesn't use a subshell so no unexpected scoping issues. It also doesn't require futzing around with arrays or the like.

One place I do use bash for loops is when iterating over args, e.g. if you create a bash function:

    function my_func() {
        for arg; do
            echo $arg
        done
    }
This'll take a list of arguments and echo each on a separate line. Useful if you need a function that does some operation against a list of files, for example.

Also, bash expansions (https://www.gnu.org/software/bash/manual/html_node/Shell-Par...) can save you a ton of time for various common operations on variables.


Once I discovered functions are pipeline-friendly, I started pipelining all the things. Almost any `for arg` can be re-designed as a `while read arg` with a function.

Here's what it can look like. Some functions I wrote to bulk-process git repos. Notice they accept arguments and stdin:

  # ID all git repos anywhere under this directory
  ls_git_projects ~/src/bitbucket |
      # filter ones not updated for pre-set num days    
      take_stale |
      # proc repo applies the given op. (a function in this case) to each repo
      proc_repos git_fetch
Source: https://github.com/adityaathalye/bash-toolkit/blob/master/bu...

The best part is sourcing pipeline-friendly functions into a shell session allows me to mix-and-match them with regular unix tools.

Overall, I believe (and my code will betray it) functional programming style is a pretty fine way to live in shell!


> I've always had trouble getting `for` loops to work predictably, so my common loop pattern is this:

for loops were exactly the pain point that lead me to write my own shell > 6 years ago.

I can now iterate through structured data (be it JSON, YAML, CSV, `ps` output, log file entries, or whatever) and each item is pulled intelligently rather than having to conciously consider a tonne of dumb edge cases like "what if my file names have spaces in them"

eg

    » open https://api.github.com/repos/lmorg/murex/issues -> foreach issue { out "$issue[number]: $issue[title]" }
    380: Fail if variable is missing
    379: Backslashes and code comments
    378: Improve testing facility documentation
    377: v2.4 release
    361: Deprecate `swivel-table` and `swivel-datatype`
    360: `sort` converts everything to a string
    340: `append` and `prepend` should `ReadArrayWithType`

Github repo: https://github.com/lmorg/murex

Docs on `foreach`: https://murex.rocks/docs/commands/foreach.html


Powershell is also a good option nowadays (although a lot of people on HN seem to dismiss it for various, imo rather superficial, reasons).

  PS> (irm https://api.github.com/repos/lmorg/murex/issues) | % { echo "$($_.number): $($_.title)" }
  380: Fail if variable is missing
  379: Backslashes and code comments
  378: Improve testing facility documentation
  377: v2.4 release
  361: Deprecate `swivel-table` and `swivel-datatype`
  360: `sort` converts everything to a string
  340: `append` and `prepend` should `ReadArrayWithType`
Or just

  PS> (irm https://api.github.com/repos/lmorg/murex/issues) | format-table number, title

  number title
  ------ -----
     380 Fail if variable is missing
     379 Backslashes and code comments
     378 Improve testing facility documentation
     377 v2.4 release
     361 Deprecate `swivel-table` and `swivel-datatype`
     360 `sort` converts everything to a string
     340 `append` and `prepend` should `ReadArrayWithType`
Or even `(irm https://api.github.com/repos/lmorg/murex/issues) | select number, title | out-gridview`, which would open a GUI list (with sorting and filtering), but I think that only works on Windows.


The reason I dismissed Powershell was that it doesn't always play nicely with existing POSIX tools, which is very much not a superficial reason :)

Murex aims to give Powershell-style types but still working seamlessly with existing CLI tools. An attempt at the best of both worlds. But I'll let others be the judge of that.

It's also worth noting that Powershell wasn't available for Linux when I first built murex so it wasn't an option even if I wanted it to be.


> I've always had trouble getting `for` loops to work predictably, so my common loop pattern is this:

The problem with the pipe-while-read pattern is that you can't modify variables in the loop, since it runs in a subshell.


It’s a trade off. I had to use a piped loop earlier this year to extract fallout new Vegas mods and textures since they all had spaces in their file names. For this it was perfect to pipe a list of the names to a loop, but for 99% of things I just use a for loop


Yup. Nearly everything has tradeoffs.

BTW, the problem I mentioned earlier can be avoided by using `< <()`:

  $ x=1
  $ seq 5 | while read n; do (( x++ )); done
  $ echo $x
  1
  $ while read n; do (( x++ )); done < <(seq 5)
  $ echo $x
  6
Almost makes me wonder what the benefit of preferring a pipe here is. I guess it's just about not having to specify what part of the pipeline is in the same shell.


It’s funny I’ve been using Linux for a decade and a half, professionally for about half that time, and yet I still go to python when arithmetic is involved. I’ve been learning a lot about the shell lately it’s like I did the bare minimum with bash just to be able to run programs and slightly automate things and it took this long for it to click with me that it’s a productive programming language in its own right (and probably faster than python.)


I use "python calc" for quick calculations at the command line:

  pc () { python3 -c "print($*)" ; } 

  $ pc 3.5**2.4
  20.219169193375105
Or "awk calc", which seems much faster: (and ^ and ** both work for powers)

  calc () { awk "BEGIN{print $*}" ; }

  $ calc 3.5**2.4
  20.2192


With the original Borne shell, the exper command was used for arithmetic.

The POSIX shell allows the $(( form for native shell arithmetic, but not the (( alternative found in bash and Korn.


BTW you can make it work in bash by setting shopt -s lastpipe. It runs the last part of the pipeline in the main shell, so the mutation variables of will persist.

Both OSH and zsh behave that way by default, which was tangential to a point in the latest release notes https://news.ycombinator.com/item?id=29292187

Another trick I've seen in POSIX shell is to add a subshell after the pipeline until the last time you want to read the variable. Like

    cat foo.txt | ( while read line; do
      f=$line
    done

    echo "we're still in the subshell f=$f"
    )


For me I always used for loops and only recently (after a decade of using Linux daily) have learned about the power of piped-loops. It’s strange to me you are more comfortable with those than for loops, but I think it does make sense as you’re letting a program generate the list to iterate over. A pain point in for loops is getting that right, e.g. there isn’t a good way to iterate over files with spaces in them using a for loop (this is why I learned about piped loops recently.)


> A pain point in for loops is getting that right, e.g. there isn’t a good way to iterate over files with spaces in them using a for loop

If those files came as arguments, you can use a for-loop as long as they're kept in an array:

  for f in "${files[@]}";
That handles even newlines in the filenames, while I'm not sure if you can handle that with a while-read-loop. IFS=$'\0' doesn't seem to cut it.

for-loops seem preferable for working with filenames. If a command is generating the list, then something like `xargs -0` is preferable.


My problem was that I had a directory with probably 200+ subdirectories each one, and the files and subdirectories below them, had a couple of spaces in the name. I typically use

    for f in ‘ls’; 
For operations like that but it was obviously built on windows (but I run steam on Ubuntu) and I never interact with windows so tbh I had never thought of this problem before.


The GNU way for handling files that have inconvenient characters in their names is:

    find ... -print0 | xargs -0 ...
It makes all the problems go away.


Also you can use readarray to store the found filenames in a bash array (to use with a for loop).


You can also have your script change the bash file seperator.

https://bash.cyberciti.biz/guide/$IFS

Something I wish I'd learned 23 years ago instead of 3 years ago :(


I have something like this in my bashrc:

   preexec ()
   {
       # shellcheck disable=2034
       _CMD_START="$(date +%s)"
   }

   trap 'preexec; trap - DEBUG' DEBUG

   PROMPT_COMMAND="_CMD_STOP=\$(date +%s)
       let _CMD_ELAPSED=_CMD_STOP-_CMD_START

       if [ \$_CMD_ELAPSED -gt 5 ]; then
           _TIME_STR=\" (\${_CMD_ELAPSED}s)\"
       else
           _TIME_STR=''
       fi; "

    PS1="\n\u@\h \w\$_TIME_STR\n\\$ "

    PROMPT_COMMAND+="trap 'preexec; trap - DEBUG' DEBUG"
Whenever a command takes more than 5 s it tells me exactly how long at the next prompt.

I didn't know about `$SECONDS` so I'm going to change it to use that.


FWIW that is supported by many zsh themes like pure, p9k, p10k, ... (and often enabled by default).

Also:

    REPORTTIME
    If nonnegative, commands whose combined user and system execution times (measured in seconds) are greater than this value have timing statistics printed for them.
which is slightly different but generally useful, and built-in.


Came here to +1 REPORTTIME for zsh


Bash 5.0 also has $EPOCHREALTIME


Installing GNU stuff with the 'g' prefix (gsed instead of sed) means having to remember to include the 'g' when you're on a Mac and leave it off when you're on Linux, or use aliases, or some other confusing and inconvenient thing, and then if you're writing a script meant for multi-platform use, it still won't work. I find it's a much better idea to install the entire GNU suite without the 'g' prefix and use PATH to control which is used. I use MacPorts to do this (/opt/local/libexec/gnubin), and even Homebrew finally supports this, although it does it in a stupid way that requires adding a PATH element for each individual GNU utility (e.g. /usr/local/opt/gnu-sed/libexec/gnubin).


You can use `wait` to wait for jobs to finish.

    some_command &
    some_other_command &
    wait


One issue with that is it won't reflect the failure of those commands.

In bash you can fix that by looping around, checking for status 127, and using `-n` (which waits for the first job of the set to complete), but not all shells have `-n`.


  > 1. Find and replace a pattern in a codebase with capture groups
  > git grep -l pattern | xargs gsed -ri 's|pat(tern)|\1s are birds|g'
Or, in IDEA, Ctrl-Shift-r, put "pat(tern)" in the first box and "$1s are birds" in the second box, Alt-a, boom. Infinitely easier to remember, and no chance of having to deal with any double escaping.


What you are doing here is to propose a very specialist approach ("Why not just use a SpaceX Merlin Engine, this is ?") when a slightly more cumbersome general approach ("This is how you get from A to B") was described.

IDEA is nice if you have IDEA.

"But not everyone uses Bash" - very correct (more fond of zsh, personally), but this article is specifically about Bash.


Yea, I've yet to come across a regex replacement tool as easy to use as jetbrains find & replace. Invaluable for certain tasks.


Using an IDE kind of handicaps you to only working with your IDE though. The shell works everywhere for every use case.


I've heard this many times, and I don't really understand the argument. I've used the shell plenty of times for this sort of work, but it's much more complicated to do certain things correctly than in an IDE. Things like looping over any filename, escaping arbitrary strings for use within sed patterns, xargs, multi-line replacements and regex lookahead require quite some in-depth knowledge, trial and error, and sometimes installing special tools because the defaults don't support the relevant patterns (like path NUL terminators or lookaheads). I don't have to deal with any of these in IDEA.


Of course it requires knowledge, but the point is that when you have that knowledge, it transfers to other things as well, making you far more capable in many more situations.


That particular IDE works on Windows though… no idea how to use Powershell…


WSL2?


From experience, WSL is pretty slow vs. native on fs operations.


For sure, but for the convenience of being able to use bash commands on Windows it's well worth it.


> git bisect is the "real" way to do this, but it's not something I've ever needed

uh, yeah, you did need it, that's why you came up with "2. Track down a commit when a command started failing". Seriously though, git bisect is really useful to track down that bug in O(log n) rather than O(n).


For the given command, if the assumption is the command failed recently, it's likely faster than bisect. You can start it and go grab a coffee. It's automatic.

I wish my usage of bisect were that trivial, though. Usually I need to find a bug in a giant web app. Which means finding a good commit, doing the npm install/start dance, etc. for each round.


I've had to git bisect a react native app in a monorepo before. Took me nearly a day to track down the commit, due to the size of node_modules and the need to do a `pod install` every time.


> and the need to do a `pod install` every time.

Phrasing makes it sound like you were doing it manually, so if you didn't know, you can just drop a script in the directory, say, "check.sh":

  #!/bin/bash
  if ! pod install; then
      exit 125
  fi
  do-the-actual-test
make it executable, and:

  git bisect run ./check.sh
The special exit code 125 tells bisect "this revision can't be tested, skip it", otherwise it just uses exit code 0 for "good" and 1-127 (except for 125) for "bad".


Thanks so much for posting this :D

It did require manual user interaction, so I had to interact with the app for quite a while in between bisect checkouts, the waiting for the installs was more the worse part over the actual tedium of running commands.


Since we're in a "bash patterns" thread anyway... the script it's running can do whatever, so let's extend it so it works in your case too, by asking you if it's good or not, so you can still do manual app interaction with "bisect run":

   #!/bin/bash

   if ! pod install; then
       exit 125
   fi

   Q=
   while [[ "$Q" != 'y' && "$Q" != 'n' ]]; do
       read -p 'Is this revision good? [y/n] ' Q
   done

   [[ "$Q" == 'y' ]]


As long as you can detect your bug from a script, you can pass the script to git bisect run and it'll run automatically.


Also provides super useful commands like skipping commits because some of your colleagues are assholes and commit non-working code.


This thread seems like a good place to ask this:

When you're running a script, what is the expected behaviour if you just run it with no arguments? I think it shouldn't make any changes to your system, and it should print out a help message with common options. Is there anything else you expect a script to do?

Do you prefer a script that has a set of default assumptions about how it's going to work? If you need to modify that, you pass in parameters.

Do you expect that a script will lay out the changes it's about to make, then ask for confirmation? Or should it just get out of your way and do what it was written to do?

I'm asking all these fairly basic questions because I'm trying to put together a list of things everyone expects from a script. Not exactly patterns per se, more conventions or standard behaviours.


> When you're running a script, what is the expected behaviour if you just run it with no arguments? I think it shouldn't make any changes to your system, and it should print out a help message with common options. Is there anything else you expect a script to do?

Most scripts I use, custom-made or not, should be clear enough in their name for what they do. If in doubt, always call with --help/-h. But for example, it doesn't make sense that something like 'update-ca-certificates' requires arguments to execute: it's clear from the name it's going to change something.

> Do you prefer a script that has a set of default assumptions about how it's going to work? If you need to modify that, you pass in parameters.

It depends. If there's a "default" way to call the script, then yes. For example, in the 'update-ca-certificates' example, just use some defaults so I don't need to read more documentation about where the certificates are stored or how to do things.

> Do you expect that a script will lay out the changes it's about to make, then ask for confirmation? Or should it just get out of your way and do what it was written to do?

I don't care too much, but give me options to switch. If it does everything without asking, give me a "--dry-run" option or something that lets me check before doing anything. On the other hand, if it's asking a lot, let me specify "--yes" as apt does so that it doesn't ask me anything in automated installs or things like that.


IMO any script that makes any real changes (either to the local system or remotely) should take some kind of input.

It's one thing if your script reads some stuff and prints output. Defaulting to the current working directory (or whatever makes sense) is fine.

If the script is reading config from a config file or envvars then it should still probably get some kind of confirmation if it's going to make any kind of change (of course with an option to auot-confirm via a flag like --yes).

For really destructive changes it should default to dry run and require an explicit —-execute flag but for less destructive changes I think a path as input on the command line is enough confirmation.

That being said, if it’s an unknown script I’d just read it. And if it’s a binary I’d pass —-help.


Thanks for the reply! I really appreciate being able to pick other folks' brains on here :)


A script is just another command, the only difference in this case is that you wrote it and not someone else. If its purpose is to make changes, and it's obvious what changes it should make without any arguments, I'd say it can do so without further ado. poweroff doesn't ask me what I want to do - I already told it by executing it - and that's a pretty drastic change.

Commands that halt halfway through and expect user confirmation should definitely have an option to skip that behavior. I want to be able to use anything in a script of my own.


What you're getting at seems to be more about CLI conventions as opposed to script conventions specifically. As such, you might want to have a look at https://clig.dev/ which is a really comprehensive document describing CLI guidelines. I can't say I've read the whole thing yet, but everything I _have_ read very much made sense.

It's been discussed here on HN before.

https://news.ycombinator.com/item?id=25304257


What is your intended audience? It is you? A batch job or called by another program? Or a person that may or not be able to read bash and will call it manually?

A good and descriptive name comes first, then the action and the people that may have to run it are next.


I am confused how this works. I would assume `SECONDS` would just be a shell variable and it was first assigned `0` and then it should stay same, why did it keep counting the seconds?

    > SECONDS
    bash: SECONDS: command not found
    > SECONDS=0; sleep 5; echo $SECONDS;
    5
    > echo "Your command completed after $SECONDS seconds";
    Your command completed after 41 seconds
    > echo "Your command completed after $SECONDS seconds";
    Your command completed after 51 seconds
    > echo "Your command completed after $SECONDS seconds";
    Your command completed after 53 seconds


`SECONDS` is like `PWD`: the shell keeps track of updating it somewhere, and it'll tell you how long your shell has been running.

https://www.oreilly.com/library/view/shell-scripting-expert/...


Thank you. That make sense now.


I really love this style of blog post. Short, practical, no backstory, and not trying to claim one correct way to do anything. Just an unopinionated share that was useful to the author.

It seems like a throwback to a previous time, but honestly can't remember when that was. Maybe back to a time when I hoped this was what blogging could be.


We need a simpler regex format. One that allows easy searching and replacing in source code. Of course, some IDE's already to that pretty well, but I'd like to be able to do it from the command line with a stand alone tool I can easily use in scripts.

The simplest thing I know that is able to do that is coccinelle, but even coccinelle is not handy enough.


I think what we really need is one regex format. You have POSIX, PCRE, and also various degrees of needing to double-escape the slashes to get past whatever language you're using the regex in. Always adds a large element of guesswork even when you are familiar with regular expressions.



Yeah I know all these dialects, the question is, which is the one I'm supposed to use for (this tool)?


> I think what we really need is one regex format.

See https://xkcd.com/927/ How standards proliferate


It's not perfect but I like sed for this.


My git-gsr (global search replace) command:

https://gist.github.com/jaysoffian/0eda35a6a41f500ba5c458f02...

Uses perl instead of gsed, defaults to fixed strings but supports perl regexes, properly handles filenames with whitespace.


To conserve host resources RFC 2616 recommends making multiple HTTP requests over a single TCP connection ("HTTP/1.1 pipelining").

The cURL project said it never properly suported HTTP/1.1 pipelining and in 2019 it said it was removed once and for all.

https://daniel.haxx.se/blog/2019/04/06/curl-says-bye-bye-to-...

Anyway, curl is not needed. One can write a small program in their language of choice to generate HTTP/1.1, but even a simple shell script will work. Even more, we get easy control over SNI which curl binary does have have.

There are different and more concise ways, but below is an example, using the IFS technique.

This also shows the use of sed's "P" and "D" commands (credit: Eric Pement's sed one-liners).

Assumes valid, non-malicious URLs, all with same host.

Usage: 1.sh < URLs.txt

       #!/bin/sh
       (IFS=/;while read w x y z;do
       case $w in http:|https:);;*)exit;esac;
       case $x in "");;*)exit;esac;
       echo $y > .host
       printf '%s\r\n' "GET /$z HTTP/1.1";
       printf '%s\r\n' "Host: $y";
       # add more headers here if desired;
       printf 'Connection: keep-alive\r\n\r\n';done|sed 'N;$!P;$!D;$d';
       printf 'Connection: close\r\n\r\n';
       ) >.http
       read x < .host;
       # SNI;
       #openssl s_client -connect $x:443 -ign_eof -servername $x < .http;
       # no SNI;
       openssl s_client -connect $x:443 -ign_eof -noservername < .http;
       exec rm .host .http;


Heres another way to do it without the subshell, using tr.

       #!/bin/sh
       IFS=/;while read w x y z;do
       v=$(echo x|tr x '\34');
       case $w in http:|https:);;*)exit;esac;
       case $x in "");;*)exit;esac;
       echo $y > .host
       printf '%s\r\n' "GET /$z HTTP/1.1";
       printf '%s\r\n' "Host: $y";
       printf 'Connection: keep-alive'$v$v;done \
       |sed '$s/keep-alive/close/'|tr '\34\34' '\r\n' > .http;
       read x < .host;
       # SNI;
       #openssl s_client -connect $x:443 -ign_eof -servername $x < .http;
       # no SNI;
       openssl s_client -connect $x:443 -ign_eof -noservername < .http;
       exec rm .host .http;


   case $x in "");;*)exit;esac
is better written as

   test ${#x} = 0||exit


The curl binary will reuse the TCP connection when fed multiple URLs. Infact it can even use HTTP2 and make the requests in parallel over a single TCP connection. Common pattern I use is to construct URLs with a script and use xargs to feed to curl.


For HTTP/1.1 pipelining, not HTTP/2 which not all websites support, the curl binary must be slower because the program tries to do more than just make HTTP from URLs and send the text over TCP. It tries to be "smart" and that slows it down. But dont't take my word for it, test it.

For example, compare the retrieval speed of the above to something like

     sed -n '/^http/s/^/url=/p' URLs.txt|curl -K- --http1.1


I was responding to your original comment, which has since been edited:

> When fed mutiple URLs, the curl binary will open multiple TCP connections, consuming more resources on the host.

Which I felt was a bit of an unfair thing to say.

I have no issue with the rest :)


Although historically it was true, I agree it was an unfair statement thus I removed it. Thank you for the correction. I will be testing curl a bit more; I rarely ever use it. I could never trust cURL for doing fast HTTP/1.1 pipelining so I am admittedly skeptical. It is a moving target that is constantly changing. The author clearly has a bias toward HTTP/2, and never really focused on HTTP/1.1 pipelining support even though HTTP/1.1 has, IME, worked really well for bulk text retrieval from a single host using a wide variety of smaller simpler programs,^1 not a graphical web browser, for at least fifteen years.^2 HTTP/2 is designed for graphical web pages full of advertising and tracking; introduced as a "standard" by a trillion dollar web advertising company who also controls the majority share web browser. Thankfully, HTTP/1.1 does not have that baggage.

1. Original netcat, djb's tcpclient, socat, etc.

2. It could be used to retrieve lots of small binary files too. See phttpget.


I'm not sure if you realize it but connection reuse and pipelining are not the same thing. Curl does connection reuse with HTTP/1.1 but not pipelining. Connection reuse is what was described up-thread, using a single connection to convey many requests and responses. Pipelining is a step further where the client pushes multiple requests down the connection before receiving responses.

Pipelining is problematic with HTTP/1.1 because servers are allowed to close a connection to signal the end of a response, rather than announcing the Content-Length in the response header. The 1.1 protocol also requires that responses to pipelined requests come in the same order as the original requests. This is awkward and most servers do not bother to parallelize request processing and response writing. Even if the client sends requests back-to-back, the responses will come with delays between them.

With HTTP/1.1 curl (like most clients) will do non-pipelined connection reuse. They push one request, read one response, then push the next request, etc. The network traffic will be small bursts with gaps between them, as each request-response cycle takes at least a round-trip delay. This is still faster than using independent connections, where extra TCP connection setup happens per request, and this is even more significant for HTTPS.


I do realise it. I have little interest in curl. It is far too complicated with too many features, IMO. Anyway, libcurl used to support "attempted pipelining". See https://web.archive.org/web/20101015021914if_/http://curl.ha... But the number of people writing programs using CURLMOPT_PIPELINING always seemed small. Thus, I started using netcat instead. No need for libcurl. No need for the complexity.

HTTP/1.1 pipelining was nor is ever "problematic" for me. Thus I cannot relate to statements that try to suggest it is "problematic", especially without providing a single example website.

I am not looking at graphical webpages. I am not trying pull resources from multiple hosts. I am retrieving text from a single host, a continuous stream of text. I do not want parallel processing. I do not want asynchronous. I want synchronous. I want responses in the same order as requests. This allows me to use simple methods for verifying responses and processing them. HTTP/2 is far more complicated. It also has bad ideas like "server push" which is absolutely not what I am looking for.

There are some sites that disable HTTP/1.1 pipelining; they will send a Connection: close header. These are not the majority. There are also some sites where there is a noticeable delay before the first reponse or between responses when using HTTP/1.1 pipelining. That is also a small minority of sites. Most have no noticeable delay. Most are "blazingly fast". "HOB blocking" is not important to me if it is so small that I cannot notice it. If HTTP/1.1 pipelining is "blazingly fast" 98% of the time for me, I am going to use it where I can, which, as it happens, is almost everywhere.

Even with the worst possible delay I have ever experienced, HTTP/1.1 pipelining is still faster than curl/wget/lftp/etc. That is in practice, not theory. People who profess expertise in www technical details today struggle to even agree on what "pipelining" means. For example, see https://en.wikipedia.org/wiki/Talk:HTTP_Pipelining. Trying to explain that socket reuse differs from HTTP/1.1 pipelining is not worth the effort and I am not the one qualified to do it. But I am qualified to state that for text retrieval from single host, HTTP/1.1 pipelining works on most websites and is faster than curl. Is it slower than nghttp2. If yes, how much slower. We know the theory. We also know we cannot believe everything we read. Test it.


One more thing: cURL only supported pipelining GET and HEAD. And on www forums where people try to sound like experts, I have read assertions that POST requests cannot be pipelined. Makes sense in theory, but I know I tried pipelining POST requests before and, surprisingly, it worked on at least one site. Using curl's limited "attempted pipelining", one could never discover this, which is another example of how programs with dozens of features can stil be very inflexible. If anyone doubts this is true, I can try to remember the site that answered pipelined POST requests.


I think the limitation on methods is related to error handling and reporting, and ambiguity as to whether the "extra" requests on the wire have been processed or not. It's the developers of the tools, who read the specifications, who find the topic "problematic." For whatever reason, there is a mildly paternalistic culture among web plumbing developers. More than in some other disciplines, they seem to worry more about offering unsafe tools, and focusing on easy to use "happy paths" rather than offering flexibility which requires very careful use.

Going back to pipelining, the textbook definition is all about concurrent use of every resource along the execution path in order to hit maximum throughput. As if the "pipe" from client application to server and back to client is always full of content/work from the start of the first input in the stream until the end of the last output. That was rarely achieved with HTTP/1.1 because of the way most of the middleware and servers were designed. Even if you could managed to pipeline your request inputs to keep the socket full from client to server, the server usually did not pipeline its processing and responses. Instead, the server alternated between bursts of work to process a request and idle wait periods while responses were sent back to the client. How much this matters in practice depends on the relative throughput and latency measures of all the various parts in your system.

I measured this myself in the past, using libcurl's partial pipelining with regular servers like Apache. I could get much faster upload speeds with pipelined PUT requests, really hitting the full bandwidth for sending back-to-back message payloads that kept the TCP path full. But, pipelined GET requests did not produce pipelined GET responses, so the download rate was always a lower throughput with measurable spikes and delays as the socket idled briefly between each response payload. For our high bandwidth, high latency environment, the actual path could be measured as having symmetric capacity for TCP/TLS. The pipelined uploads got within a few percent of that, while the non-pipelined downloads had an almost 50% loss in throughput.

If I were in your position and continued to care about streaming requests and responses from a scripting environment, I might consider writing my own client-side tool. Something like curl to bridge between scripts and network, using an HTTP/2 client library with an async programming model hooked up to CLI/stdio/file handling conventions that suit my recurring usage. However, I have found that the rest of the client side becomes just as important for performance and error handling if I am trying to process large numbers of URLs/files. So, I would probably stop thinking of it as a conventional scripting task and instead think of it more like a custom application. I might write the whole thing in Python and worry about async error handling, state tracking, and restart/recovery to identify the work items, handle retries as appropriate, and be able to confidently tell when I have finished a whole set of work...


s/HOB/HOL/


I want to like this, but the for loop is unnecessarily messy, and not correct.

   for route in foo bar baz do
   curl localhost:8080/$route
   done
That's just begging go wonky. Should be

  stuff="foo bar baz"
  for route in $stuff; do
  echo curl localhost:8080/$route
  done
Some might say that it's not absolutely necessary to abstract the array into a variable and that's true, but it sure does make edits a lot easier. And, the original is missing a semicolon after the 'do'.

I think it's one reason I dislike lists like this- a newb might look at these and stuff them into their toolkit without really knowing why they don't work. It slows down learning. Plus, faulty tooling can be unnecessarily destructive.


Even more correct would be to use an array:

  stuff=("foo foo" "bar" "baz")
  for route in "${stuff[@]}"; do
    curl localhost:8080/"$route"
  done


"${stuff[@]}" would be turning it back into "foo bar baz" though, right? I think if you were using arrays for this, it'd be something like:

    stuff=("foo" "bar" "baz");
    array_length=${#stuff[@]};
    for i in $(seq 0 $array_length); do
      curl localhost:8080/${stuff[i]}
    done
I'm betting even that isn't right: as soon as bash arrays are a thing, I reach for a different language.

[edit]: trying to get formatting correct


"${stuff[@]}" expands into the elements of the array, with each one quoted - so even that first "foo foo" element with a space will be handled correctly. That is how I would write the loop as well, in general.

However, your technique also works, with some tweaks:

* The loop goes one index too far, and can be fixed with seq 0 $(( array_length - 1 ))

* There should be quotes around ${stuff[i]} in case it has spaces


Oh, that's way better for sure! Thanks for the explanation


Yes that is more correct, I describe that here:

Thirteen Incorrect Ways and Two Awkward Ways to Use Arrays http://www.oilshell.org/blog/2016/11/06.html

In Oil the syntax is simplified to

    const stuff = %("foo foo" bar baz)
    for route in @stuff {
      curl localhost:8080/$route  # no quotes needed
    }


The bash syntax for arrays is obtuse and no one will spot when an error creeps in. It's worth to make an effort to say with space separated strings, as long as you are with bash. More advanced data structures is often an indication that it's worth glacing at something like python.

Should you need to handle spaces, it is often much easier to go with newline separated strings and use "| while read".

This contruction has the added benefit of the data not needing to fit in your environment. This can be a real issue, and is not at all obvious when it happens.


and on the off-chance that `stuff` must be a space-separated string for w/e reason:

  IFS=' ' read -ra routes <<<"${stuff}"
  for route in "${routes[@]}"; do
    curl localhost:8080/"$route"
  done


These are the types of stuff you typically run on the fly in the command line, not full blown scripts you intended to be reused and shared.

Here's a few examples from my command history:

  for ((i=0;i<49;i++)); do wget https://neocities.org/sitemap/sites-$i.xml.gz ; done

  for f in img*.png; do echo $f; convert $f -dither Riemersma -colors 12 -remap netscape: dit_$f; done
Hell if I know what they do now, they made sense when I ran them. If I need them again, I'll type them up again.


Seems like you could say the same thing about every snippet e.g. running jobs is the original purpose of a shell, (3) can be achieved using job specifications (%n), and using the `jobs` command for oversight.


Thanks for the note about the semicolon! Added it


Sure thing. I apologize if my comment came across as overly negative. My goal was constructive criticism, but based on the downvotes, I'm guessing I missed the mark there.


If I find myself running a set of commands in parallel, I'd keep a cheap Makefile around: individual commands I want to run in parallel will be written as phony targets:

  all: cmd1 cmd2
  
  cmd1:
      sleep 5
  
  cmd2:
      sleep 10

And then

  make -j[n]


My daily bash pattern.

  cat file
  cat file | grep something



You could just `grep "something" file.


Indeed - it's a known anti-pattern[0].

[0] https://en.wikipedia.org/wiki/Cat_(Unix)#Useless_use_of_cat


I prefer less tbh, less pollution of the term output.


Huh, never knew about $SECONDS.


> Use for to iterate over simple lists

I definitely use this all the time. Also, generating the list of things over which to iterate using the output of a command:

    for thing in $(cat file_with_one_thing_per_line) ; do ...


Why not?

    xargs -n1 command < file_with_one_thing_per_line


that's like bash 101, not sure why it's in there


It’s simple enough to use macos sed here instead of needing gsed: just use ‘-i ''’ instead of ‘-i’.


curl localhost:8080/{foo,bar,baz}


A lot of genious moves here I have never seen or thought of. Brilliant. Grabbing pids was mindblowingly effective.


This post and comments section means I no longer wonder why 99% of shell scripts I come across look inept. I'm sorry guys but seriously please actually learn bash (and ideally not from this blog post). There's so many things wrong in the post and the comments that it's difficult to enumerate.

To start with, if you ever feel the need to write a O(n) for loop for finding which commit broke your build, you DID need git-bisect.

Definitely DONT'T wait for PIDs like that and if you do want to write code like that, maybe actually use the arrays bash provides?

If your complaint is that for a in b c d; do ...; done is unclear, then maybe also use lists there, because what's definitely LESS clear is putting things in a variable and relying on splitting.

And most importantly, DO quote things (except when it's unnecessary).


Please don't call names or post supercilious putdowns. It's not the culture we want here.

If you know more than others, that's wonderful and it's great to share some of what you know so the rest of us can learn. But please do it without putdowns, and please do it in a way people can actually learn from. "Definitely DONT'T wait for PIDs like that" doesn't actually explain anything.

https://news.ycombinator.com/newsguidelines.html


> Definitely DONT'T wait for PIDs like that and if you do want to write code like that, maybe actually use the arrays bash provides?

What pattern would you recommend for waiting for PIDs / parallelizing commands & preserving exit codes? Fair point about arrays being a better fit rather than a string there.


So in this case there's a couple notable things:

`wait` can take no parameters this means that if you just ran a bunch of things in the background in your script and want to wait for all of them to finish, you don't need to track the PIDs or a loop, you can just `wait`.

`wait` is a bash builtin (in this case) and as such it has no parameter limit (although I am told that actually there are some weird limits but it's very unlikely you will be able to spawn enough processes at once from a bash script to hit the limits). Given an array `pids` you can just do: `wait "${pids[@]}"`

The only problem with the two above approaches is that in the former case, wait loses the return status and in the second case wait loses all but the status of the last ID you pass it, so the third option is:

    pids=()
    do_thing_1 &
    pids+=("$!")
    do_thing_2 &
    pids+=("$!")
    for pid in "${pids[@]}"; do
        wait "$pid" || status=$?
    done
    exit "${status-0}"
Now you only have the issue left that this will report the LAST failing status.


Thanks for the explanations! Tracking if anything fails is normally important for what I'm doing, so I quite like that array-based solution. I really like that `"$pid" || status=$?`; it's much nicer than `if ! wait $pid; then status=1; fi` I have in there.


Try GNU parallel.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: