*Shell commands are great for data processing pipelines because you get parallel...

omoikane · on Jan 18, 2015

"grep <pattern> <files>" is not the same as "cat <files> | grep <pattern>", in that the former will prefix lines with filenames if there is more than one input file. What you want instead is "grep -h <pattern> <files>".

The advantage of using cat, therefore, is the few seconds of laziness saved in not reading the manual.

salgernon · on Jan 18, 2015

The advantage to using "cat foo | grep pattern" is that it is trivial to ^p and edit the pattern before adding the next pipeline sequence.

philsnow · on Jan 19, 2015

fwiw

    $ <filename grep <pattern>

no shell I'm aware of restricts you to placing redirections at the end, you can throw them on the beginning no problem.

clarkm · on Jan 18, 2015

You can make it even faster by using fgrep since you're not searching for a regex.

simpleigh · on Jan 18, 2015

The article addresses this:

> The -F for grep indicates that we are only matching on fixed strings and not doing any fancy regex, and can offer a small speedup, which I did not notice in my testing.

I guess grep is probably clever enough to choose a faster matching algorithm once it's parsed the pattern and discovered it doesn't contain any regex fun.

clarkm · on Jan 18, 2015

In general grep seems smart enough that it would do that, but it hasn't been my experience. Just last week I was searching through a couple hundred gigs small xml files. I found that:

    $ LC_ALL=C fgrep -r STRING .

was much faster than plain grep. This was on a CentOS 5 box, so maybe newer versions of grep are smarter.

But then again, if I was on a newer box I'd just install and use ack or ag.

paralelogram · on Jan 18, 2015

LC_ALL=C makes grep faster because text matching is normally locale-sensitive, for example 'S' ('\x53' in Big5) is not a substring of '兄' ('\xA5\x53' in Big5).

aadrake · on Jan 18, 2015

Thank you for the comment. The fact that it can run independently as soon as it's forked was exactly the point I was making.

Also, the cat | grep pipeline is illustrative. I remove it at the end.

petemc_ · on Jan 18, 2015

You could skip both cat and grep and do it all in awk. Also if speed was an issue you would want to make sure LANG=C is set for grep.

Edit: I see they did use awk later in article, I should really read all of things before commenting.