Hacker News new | past | comments | ask | show | jobs | submit login
Stop Piping Cats (ibm.com)
178 points by helwr on Feb 10, 2010 | hide | past | favorite | 75 comments



This is my least-favorite Internet meme. People "pipe cats" because they want the entire pipeline to read left-to-right, like:

   cat file | xargs foo | grep bar | sort | wc -l
It just looks nicer than:

   < file xargs foo | grep bar ...


I would also add that not all tools follow the '-' convention, and the ones that do can break in corner-cases. Why invest the effort of remembering how every tool works and which ones don't, for a obfuscating micro-optimization? /proud useless user of cat


Exactly. I tried to figure out how to make xargs read from a file instead of stdin, and was unsuccessful. My "cat ... |" works every time (at the expense of 5 miliseconds of CPU time. oh noes.)


There's a nice consistency in doing it that way--it's very easy to think of cat as turning an "inert" file into a stream of data, each program as a function that applies a transformation to a stream with a particular kind of content, and the pipe as function composition (given compatible stream content). A very comfortable programming-like perspective, at least to me. (Pop quiz: What programming idiom closely resembles this "repackage content and apply sequential transformations" approach?)


Spoiler alert: See also Oleg Kiselyov's "Monadic i/o and UNIX shell programming" (http://okmij.org/ftp/Computation/monadic-shell.html).


Seriously?

You failed at that?

I'm intrigued because it always occured to me that you were a particularly fastidious person (with all due respect).

I often come accross issues that would seem for all the world to be a pointless artifact of an accepted convention. If this accepted convention is an acquired trait and one learns exclusivly from a limited meduim (irc frustrates me), I find it very curious that they persist as they sometimes act as a difficult barrier for a thorough understanding. This xargs thing, though I know almost nothing about it, seems to parallel similar problems I've encountered and I for one would be interested to hear your take on this (and HN in general).

Or just some food for thought for your next blog post maybe?


Well, my guess is that xargs simply does not accept a filename as the argument. This comes from reading the --help and grepping the manpage for "file".

The point was, when I used "cat file |", I knew it was going to work, and it did. When I tried to eliminate the cat by using the program's built-in ability to read a file, I had to read several manpages before determining that it was not possible. All because some tutorial says "cat is useless", when it clearly saved me much more time than the extra CPU time it used.

And if you are actually asking; xargs is just a utility to read command-line-arguments from stdin. `echo -e "foo\nbar" | xargs rm' === `rm foo; rm bar' (or depending on the xargs implementation; `rm foo bar'). It kind of reminds me of a "functor map" operation, where stdin is a functor (of command-line arguments), and command-line programs are functions. (I will now mention that xargs also does "join" on the "results" of the "function", which is very ... monad-like. But "Monads are teh awesome and everything is one" is my second-least-favorite Internet meme, so I will spare you. :)


The whole 'useless use of cat' meme is basically designed to let Randal Schwartz make fun of people. At one point it may have made a difference, and certainly if you have a shell script that is getting called 10,000 times a day it might make sense to optimise it. But for doing stuff from the commandline; weird shell acrobatics is a premature optimisation.


The whole 'useless use of cat' meme is basically designed to let Randal Schwartz make fun of people.

So true.

My requirement for a shell command is that it be reasonably easy to assemble, return something resembling the correct answer, and that it run in a reasonable amount of time.

If your requirements are more strict than those, it's time to write a real program.


  I tried to figure out how to make xargs read from a file 
  instead of stdin, and was unsuccessful.
"xargs -a <filename>" First option described in the man page on Red Hat.


My home machine (Debian GNU/Linux) says "--arg-file", and it looks like a relatively new feature. My RHEL box at work definitely didn't have anything with the word "file" in it.


The whole thread descending from here is an excellent example of what I mean by corner-cases and memorization. :)


could be a GNU vs. BSD thing.


BSD xargs doesn't seem to have the -a option.


Sometimes you can use /dev/fd/0, but I agree, it's a stupid micro optimization, and I don't understand why people get so righteous about it. My co-worker would correct you every time he saw it, but then again he's a pedantic geek.


It's still good to keep it in mind, in case there are instances where it's being executed many times causing a bottleneck (though one-liners aren't really meant to be used in such situations).


zsh and rc provide primitives for this, this is rc

    diff <{ps} <{sleep 3; ps}


Equivalent in bash:

    diff <(ps) <(sleep 3; ps)


Speak for yourself, and I don't know anyone that puts the input redirection first. I do this:

  (xargs foo | grep bar | sort | wc -l) < file
It makes the pipeline one command, both lexically (verb comes first) and concretely (the pipeline is kicked off in a subshell)


At the expense of the data flowing from right to left to right, from the outside in. At least with cat, data flows unambiguously from left to right.


But using cat totally fucks up the calling conventions: it puts the operand as the second of many arguments. Which one of these doesn't belong?:

  verb file
  cat file | verb1 | verb2 | verb3
  (verb1 | verb2 | verb3) < file
  verb123 file
EDIT: Consider the calling conventions, which one of these handles it's arguments differently? Assume that verb123 is an equivalent to the pipeline -- the subshell+stdin construction lends itself to shell aliases:

  alias verb123="(verb1 | verb2 | verb3) < "


I have no idea which one doesn't belong. Is 'verb123' the moral equivalent of '(verb1 | verb2 | verb3)'? In that case, it's the first one. But that's visible from the first word on the line, so I'm still not getting the picture.

Of course, with the usual argument vs stdin conventions, either of the middle two could also be rewritten:

    verb1 file | verb2 | verb3
This is probably how I'd write such a chain.


Spoken like a Haskell programmer, who might write:

   (wc Lines) . sort . (grep "bar") . (xargs "foo") =<< file
But note in this case that all the data "flows in the same direction".


has been a Haskell programmer since 2005, before the dawn of dons

  wc Lines $ sort $ grep "bar" $ xargs "foo" =<< file


That does not parse the way you think it does; the fixity of =<< is greater than $. You would need the shell-style parens if you insist on application instead of composition :)

    Prelude> :info $
    ($) :: (a -> b) -> a -> b 	-- Defined in GHC.Base
    infixr 0 $
    Prelude> :info =<<
    (=<<) :: (Monad m) => (a -> m b) -> m a -> m b
  	-- Defined in Control.Monad
    infixr 1 =<<
    Prelude> :info .
    (.) :: (b -> c) -> (a -> b) -> a -> c 	-- Defined in GHC.Base
    infixr 9 .
Incidentally, the parens in my example are actually unnecessary. Function application is about 10, and (.) is 9.


I almost never use <

It's too easy to make a mistake and type >

..and that can ruin your whole day.


I have fucked myself more than once by typing > instead of >>

I find that to be a far more pernicious design error -- they should have made the longer token the destructive one, or used another character in it.


zsh with 'setopt noclobber' won't overwrite an existing file with '>', instead requiring '>|'. The history recall conveniently fills in the '|' for you if you up-arrow after a failed call. It works pretty nicely - I imagine other shells have similar.


But does the time it takes to launch the subshell equal the amount of time it takes to launch cat? If so your improvement is a NOOP.


fork + 3x(fork+exec) is going to be cheaper than 4x(fork+exec), especially if cat isn't resident.

The point is a logical improvement anyway (not burying the input argument near the beginning). I'm kind of surprised that the bash folks haven't turned cat into a builtin like they did with time and some of the other coreutils.

It's too bad it's about 30 years too late to stem the tide of shit like cat -v: http://harmful.cat-v.org/cat-v/


By the reasoning in your link, your suggestion of bash making 'cat' one of the built-in commands is the "cancer that's bloating UNIX."


The shell has always been the kitchen sink full of glue holding the whole thing together. There have always been builtins: language control structures, job control, etc. -- there are some things you can't trust others not to fuck up, and where the coupling would just get ridiculous.

Here's cat as a pure sh builtin:

  shcat {
    for arg in "$@"; do
      exec 3<>"$arg"
      while read line <&3; do
        echo "$line"
      done
      exec 3>&-
    done
  }
The shell by it's very nature can't just do exactly one task well, it's a programmable environment for living in. The cancer that's bloating UNIX was the way that the BSD and especially the GNU crews took simple tools and cross-pollinated them randomly with stupid shit. Try running "/bin/true --help" on a GNU system sometime -- there's a damn good reason why "your shell may have its own version of true".


  % /bin/true --help
  Usage: /bin/true [ignored command line arguments]
    or:  /bin/true OPTION
  Exit with a status code indicating success.

        --help     display this help and exit
        --version  output version information and exit

  NOTE: your shell may have its own version of true, which usually supersedes
  the version described here.  Please refer to your shell's documentation
  for details about the options it supports.

  Report bugs to <bug-coreutils@gnu.org>.
I wouldn't necessarily call that 'bloated' unless you feel that any program that uses one bit more than absolutely necessary should be scrapped as 'bloated beyond belief.'

> there's a damn good reason why "your shell may have its own version of true".

Because why exactly?


If all you want is to group commands, you could use {...} instead of (...) so you don't have to spawn a subshell.


For ad-hoc stuff, I tend to do different things depending on where my cursor is and how I built up to a complex pipeline.


You get mostly both with

    xargs foo < file | grep bar | sort | wc -l


Exactly. On a modern computer, the gain from not piping cats is almost always negligible. I don't think most shell scripts today ever get to the point where you have to worry about inner loops and the such.

These days, if you hit a performance wall from spawning too many cats, I would think you switch to some scripting language where you have everything in one process. Premature optimization, people...


What's the rationale for piping sort into wc?


That looks like it was a sample line that wasn't particularly useful, but if you do:

cat file.txt | sort | uniq | wc -l ## notice uniq

it gives you the count of unique lines. If you omit the sort in that instance it will fold duplicates together a/b/a is three lines, not two.


Instead of "sort | uniq" you could just use "sort -u"


There's also the argument about debugging, one can insert a tee in any stage of

   cat file | xargs foo | grep bar | sort | wc -l


I use cat because often I wind up doing a number of different commands on the same file. It's a lot easier to edit the end of the line, especially if you're just adding another filter, than to go back and modify the beginning.

eg

     cat file | less
     cat file | grep thing
     cat file | grep otherthing
     cat file | grep otherthing | cut stuff
instead of

     less file
     grep thing file
     grep otherthing file
     grep otherthing file | cut stuff


Be aware that "cat file | less" is much more expensive than "less file" because it forces less to buffer everything it reads (it can't seek on a pipe). "cat huge-logfile | tail" is especially bad because it uselessly reads the whole file (and evicts a bunch of more important data from your buffer cache) where "tail huge-logfile" would just seek backwards from the end until it has enough text.


Sounds like you might want to learn about !$ (last token of previous command).


Yeah that's interesting, but the second version is still more complicated to assemble and takes more keystrokes, assuming up arrow gives the entire previous command.


Let's count for your example. In all cases, I'll exclude the actual name of the file. We type in the first command:

     cat file | less
     less file
The second version is 5 fewer keystrokes. On to the next command:

     cat file | grep thing
     grep thing !$
The second example is the same or fewer keystrokes. In both cases, you have to type "grep thing". In the first, you have to press the up arrow and backspace over "less" (at least three keystrokes), and in the second, you have to type an extra " !$".

I'll skip moving to "cat file | grep otherthing" or "grep otherthing !$", and consider the change to get to

     cat file | grep otherthing | cut stuff
     grep otherthing file | cut stuff
In both cases, you have to type " | cut stuff". If you key in the second example as "!! | cut stuff", that's an extra two keystrokes. If you key in the first as up arrow + "| cut stuff", that's only 1 extra keystroke.

In total, my version saves keypresses in the specific example and doesn't seem much worse in general.


The most useful thing I took away from this was not 'piping cats' but instead the interesting syntax for creating multiple directories in one go:

  ~ $ mkdir -p tmp/a/b/c
  ~ $ mkdir -p project/{lib/ext,bin,src,doc/{html,info,pdf},demo/stat/a}


It should be pointed out, since the article does not, that that is merely one example of shell expansion that can be used anywhere. It is not a mkdir feature.

    $ echo project/{lib/ext,bin,src,doc/{html,info,pdf},demo/stat/a}
    project/lib/ext project/bin project/src project/doc/html 
    project/doc/info project/doc/pdf project/demo/stat/a
(I added a linefeed to prevent wrapping.)


My favourite use is when installing packages using package managers, especially when I need the dev packages.

Old fink example:

   sudo fink install lib{png,jpeg,ssl,whatever}{,-{dev,shlibs}}
which expands to:

   sudo fink install libpng libpng-dev libpng-shlibs libjpeg libjpeg-dev libjpeg-shlibs libssl libssl-dev libssl-shlibs libwhatever libwhatever-dev libwhatever-shlibs


Brilliant.

I am ashamed to admit that I usually say "apt-get install libfoo.*", wait for the downloads to start, hit Control-c, and then cut-n-paste the package names I actually want onto the command-line.

It sounds bad when I type it out, but it's really not the most horrible thing ever. But your way is definitely like 83x better.


If you don't know the exact package names, Ubuntu happily auto-completes apt-get and aptitude on package names. So I would just type "aptitude install libfoo" and hit tab a couple of times to see what libfoos can I install.


Or `M-*` to expand all the completions and then, if necessary, delete the ones you didn't want. Beats typing the extensions.


This is golden, I didn't know that.

A perfect example of the infinite features that can be found in bash manual page. I've been using bash since the early 90's and I've written lots of non-trivial programs in bash, and I know a lot many other people don't know and yet I had somehow managed to miss this pearl.


Another fun thing to do is use an empty entry, like

  mv .xinitrc{,.bak}
or, if you have some stuff mounted under another mountpoint, such as when installing gentoo:

  umount /mnt/gentoo/{proc,dev,boot,}
That empty entry expands to the base path, which can be a nice shortcut.


Huh, I'm definitely going to have to play with that!

I only drop into a shell at most for 20 minutes a day at the moment, so a lot of the really neat time-saving trick simply don't stick in my head due to disuse...


Shells are one of those things that I find you should basically just plan on reading the man page every couple of months. Every time you do, you are virtually sure to discover something useful you'd swear you never saw before, even though you're pretty sure it's the exact same man page as before....


It works in makefiles, too. (100% sure about BSD make, pretty sure about GNU make and others.)


let's remember the the -p option is cool, but could cause pain if you mis-type one of the directories. for example:

  $ cd
  $ ls tmp
  - some files lists -
  $ mkdir -p tmm/a/b/c
oops (remember your esc key)


Why would this cause you any more pain than normal? You can just remove that chain with rm -r, surely?


normally you would get an error on the first mkdir, but you could be tired and just using your history and changing the commands and realize a little late you did a lot of stuff somewhere you shouldn't. I had a bad night once with this (still use the -p anyway, but...).


rmdir -p might be a bit safer, but it can get greedy and remove all the way up to root if you specify an absolute path.


That 0.005 seconds I saved by not piping cat will significantly increase my productivity! No more wasting time!


Oh come on, you can't mention that without mentioning the Useless Use of Cat Award: http://partmaps.org/era/unix/award.html


Great practical advice. I still pipe cats though because it's just more intuitive to me for setting up complex pipes.


especially if your using a test file instead of the program you might eventually use


"Great advice. I don't follow it." Mayhap it ain't so great?


You also can't follow a file with grep, but you can with tail. Doing so requires the grep option "--line-buffered", which sacrifices some performance, but not much compared to actually viewing the log data.

  tail -f access.log | grep --line-buffered "GET /blog/post "


I do this a lot, and I've never used "--line-buffered", nor apparently needed it. Google shows me a lot of people using that as a solution to a problem (not getting results immediately) I've never seen. Weird.


It's more likely if you have grep and several other operations piped together - the normal buffering leads to grep only printing every time it gets a full block of data, typically several lines.


I only started using it because results were not appearing in "real-time", so it may depend on your OS, or your definition of "real-time".


Why? The extra 3 letters of "cat" are usually going to take less time to type the mental overhead of "where do I put the input file for this command" or the additional pipeline reasoning. I actually sometimes do it one way and sometimes the other, whatever comes to my fingers first (which is going to vary according to how I'm visualizing the pipeline of tasks in my head).

In general I far prefer the (it seems to me) more Unixy way of having lots of very simple commands and chaining them together over the use of extra options and arguments.


There is some really great info on that site. I'm a sys admin of 3 years and to see the comparison in performance is going to force me to change my scripting habits.


Really? You're changing your scripting habits based on gaining milliseconds?


You have to define usage here. They are giving an example of a single piped command on a file of who knows what size, probably not very big, hence milliseconds. However when it comes down to scripting large log processing, backups, and other forms of automation then I think it will yield better performance. Not to mention good coding practices, I am looking to improve in any form, something like this changes the way you think which will prove to be very useful when programming in ruby, c, java, etc. Not necessarily the use of piping commands, but just about how to be more efficient and less wasteful.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: