Shell commands are great for data processing pipelines because you get parallelism for free. For proof, try a simple example in your terminal.
sleep 3 | echo "Hello world."
That doesn't really prove anything about data processing pipelines, since echo "Hello world." doesn't need to wait for any input from the other process; it can run as soon as the process is forked.
cat *.pgn | grep "Result" | sort | uniq -c
Does this have any advantage over the more straightforward verson below?
grep -h "Result" *.pgn | sort | uniq -c
Either the cat process or the grep process is going to be waiting for disk I/Os to complete before any of the later processes have data to work on, so splitting it into two processes doesn't seem to buy you any additional concurrency. You would, however, be spending extra time in the kernel to execute the read() and write() system calls to do the interprocess communication on the pipe between cat and grep.
Also, the parallelism of a data processing pipeline is going to be constrained by the speed of the slowest process in it: all the processes after it are going to be idle while waiting for the slow process to produce output, and all the processes before it are going to be idle once the slow process has filled its pipe's input buffers. So if one of the processes in the pipeline takes 100 times as long as the other three, Amdahl's Law[1] suggests that you won't get a big win from breaking it up into multiple processes.
"grep <pattern> <files>" is not the same as "cat <files> | grep <pattern>", in that the former will prefix lines with filenames if there is more than one input file. What you want instead is "grep -h <pattern> <files>".
The advantage of using cat, therefore, is the few seconds of laziness saved in not reading the manual.
> The -F for grep indicates that we are only matching on fixed strings and not doing any fancy regex, and can offer a small speedup, which I did not notice in my testing.
I guess grep is probably clever enough to choose a faster matching algorithm once it's parsed the pattern and discovered it doesn't contain any regex fun.
In general grep seems smart enough that it would do that, but it hasn't been my experience. Just last week I was searching through a couple hundred gigs small xml files. I found that:
$ LC_ALL=C fgrep -r STRING .
was much faster than plain grep. This was on a CentOS 5 box, so maybe newer versions of grep are smarter.
But then again, if I was on a newer box I'd just install and use ack or ag.
LC_ALL=C makes grep faster because text matching is normally locale-sensitive, for example 'S' ('\x53' in Big5) is not a substring of '兄' ('\xA5\x53' in Big5).
Also, the parallelism of a data processing pipeline is going to be constrained by the speed of the slowest process in it: all the processes after it are going to be idle while waiting for the slow process to produce output, and all the processes before it are going to be idle once the slow process has filled its pipe's input buffers. So if one of the processes in the pipeline takes 100 times as long as the other three, Amdahl's Law[1] suggests that you won't get a big win from breaking it up into multiple processes.
[1] https://en.wikipedia.org/wiki/Amdahl%27s_law
Edit: As someone pointed out, my example needed "grep -h". Fixed.