Hacker News new | past | comments | ask | show | jobs | submit login
GNU Parallel (gnu.org)
205 points by ingve on Dec 26, 2016 | hide | past | favorite | 83 comments



As always gets brought up when GNU parallel is mentioned: xargs does most of the use cases you'd need for parallel.

xargs -n1 -P4

Would be at most one arg from the arg list run with 4 jobs. http://stackoverflow.com/questions/28357997/running-programs...


I have a small cluster of machines that I run experiments on. GNU parallel makes the dispatch of jobs on remote machines very easy.

In addition, I often use it to search for sequences by running grep in parallel. For example

$ parallel 'grep {1} -f haystack.txt' :::: many_needles.txt

Where {1} is a single line in many_needles.txt


If you find yourself searching lots of haystacks, and your needles are just text and not a regex, a better approach is to stuff all the needles into some kind of index, then chop up the haystack into overlapping tiles (of variable width from the smallest needle to the largest), then search each tile against the index of needles. This effectively searches all the needles at once and turns the operation from O(n) where n is the number of needles to O(m) where m is the number of tiles in haystack.txt.

It may seem to be a trivial difference, but then you can search multiple haystacks at once fairly easily, and this approach scales to hundreds of millions of needles at once. The code for it isn't very difficult either, heck you can just use an in memory SQLite dB to get a searchable, temporary, index and rely on using some of the most tested software in history.


This also works for sorting as well and is typically called Radix Sort or Bucket Sort.

Basically using unique attributes of the data you then divide and conquer on those attributes (e.g. for Radix you make buckets based on the digits).


Unless those patterns are regexes, you should just be using

    $ fgrep -f many_needles.txt haystack.txt


How much faster is a plain text search really than an regexp without special characters? You'd think this would be quite easy to optimise for a regexp engine.

I admit I try to use -f all the time but your post suddenly made me realise I'd never actually measured the effect. :/

Edit: yes sorry I meant -F


There are a few things to clear up here. Namely, fgrep (equivalent to grep -F) is orthogonal to -f. fgrep/`grep -F` is for "fixed string" search, where as -f is for reading multiple patterns from a file. So `fgrep -f file` means "do a simple literal (non-regexp) search for each string in `file`."

There are two possible reasons why one might want to explicitly use fgrep/`grep -F`: 1) to avoid needing to escape a string that otherwise contains special regex characters and 2) because it permits the search tool to avoid the regexp engine entirely.

(1) is always a valid reason and is actually quite useful because escaping regexes can be a bother. But whether (2) is valid or not depends on whether your search tool is smart enough to recognize a simple literal search and automatically avoid the regex engine. Another layer to this is, of course, whether the regex engine itself is smart enough to handle this case for you automatically. Whether these optimizations are actually applied or not is difficult for a casual user to know. I don't actually know of any tool that doesn't optimize the simplest case (when no special regex features are required and it's just a simple literal search), so it seems to me that one should never use fgrep/`grep -F` for performance reasons alone.

However, if you use the `-f` flag, then you've asked the tool to do multiple string search. Perhaps in this case, the search tool doesn't try as hard to do simple literal optimizations. Indeed, I can actually witness evidence in favor of this guess. The first command takes 15s and the second command takes 10s:

    LC_ALL=C grep -c -f queries /tmp/OpenSubtitles2016.raw.en
    LC_ALL=C grep -c -F -f queries /tmp/OpenSubtitles2016.raw.en
The contents of `queries`:

    $ cat queries 
    Sherlock Holmes
    John Watson
    Professor Moriarty
    Irene Adler
grep in this case is GNU grep 2.26. The size of /tmp/OpenSubtitles2016.raw.en is 9.3GB. The only difference between the commands is the presence of the -F switch in the second command. My /tmp is a ramdisk, so the file was already in memory and therefore isn't benchmarking the speed of my disk. The corpus can be downloaded here (warning, multiple GB): http://opus.lingfil.uu.se/OpenSubtitles2016/mono/OpenSubtitl...

Interestingly, performing a similar test using BSD grep shows no differences in the execution time, which suggests BSD grep isn't doing anything smart even when it knows it has only literals (and I say this because BSD grep is outrageously slow).

As a small plug, ripgrep is four times faster than GNU grep on this test and has no difference whether you pass -F or not.

(This is only scratching the surface of literal optimizations that a search tool can do. For example, a good search tool will search for `foo` when matching the regex `\w+foo\d+` before ever entering the regex engine itself.)


This is a great comment and ripgrep deserves a more prominent plug, clickable: https://github.com/BurntSushi/ripgrep


BSD grep decides on a pattern by pattern basis which match engine to use. -F is unlikely to affect performance.


Oh dear, you appear to be correct. Adding additional queries to `queries` (while being careful not to increase total match count by much) appears to increase search time linearly. From that, it looks like BSD grep is just running an additional pass over each line for each query.


(Sorry, this is mildly off-topic.) Not sure if this fits your usecase, but you should check out codesearch if you haven't already: https://github.com/google/codesearch

(Russ Cox's excellent writeup is here: https://swtch.com/~rsc/regexp/regexp4.html)


I've also always gotten everything to run with just xargs and minimal scripting.

For my taste GNU parallel gets recommended a bit too quickly e.g. on Stackoverflow when the standard tools would do just fine. Your linked SO question is a prime example of that, xargs specific question yet there's a response that dismisses it entirely and suggests parallel when xargs can easily do the task at hand.


I've taken to using gnu parallel over xargs at all opportunities because I find it much easier to use. I find the parallel commands much shorter and easier to understand. Added to the fact it's more featureful and I find it pretty much everywhere already, why not?


xargs -P is not POSIX. Please don't use it if you ship your scripts to end users.


Well, GNU Parallel is not POSIX either, so it boils down to which tool you want to ship instead.


Well, it's a different matter to depend on one standalone tool then to depend on a specific coreutils.


I'd bet on GNU xargs being more likely to be installed than GNU Parallel.

One of the two is part of a standard package that all Linux systems, and many non-Linux systems, have installed. The other is a special-purpose tool.


>I'd bet on GNU xargs being more likely to be installed than GNU Parallel.

Your bet is irrelevant in the face of widely implemented standards. POSIX will outlive GNU coreutils. I'd bet your system also probably has a package manager that makes it easy to mark parallel as a dependency of your software.

>many non-Linux systems

Linux is pretty much the only system that typically ships GNU coreutils. The only one that comes to mind is Hurd.


For use within scripts, I think it's a much easier matter to determine that parallel is on $PATH than to try to do feature detection on xargs. Granted, it's always possible that the parallel on $PATH might not be the one you're thinking of, but... Whatevs.


> Granted, it's always possible that the parallel on $PATH might not be the one you're thinking of, but... Whatevs.

For exactly this reason, GNU Parallel has the option --minversion.

So in your script you put:

parallel --minversion 20140722 || exit

if the rest of your script depends on functionality only present from version 20140722.

If the parallel in the $PATH is not GNU Parallel it will fail (and thus exit). If it _is_ GNU Parallel it will fail if the version is < 20140722 and succeed otherwise.


It is supported outside Gnu though, eg on FreeBSD.


xargs is fucked up when it comes to handling spaces in filenames and such. In fact, this is pretty much the only reason I use GNU parallel — I rarely need to actually run stuff in parallel (but it usually doesn't hurt either), but I need to do something pretty complicated (list | grep | sort | uniq | feed it to feh / whatever) over multiple arguments, without having to worry if these arguments contain spaces, quotes, unicode symbols, etc.


Protip: for "strange" characters, see the -0 argument to xargs. If the file list comes from find, see the -print0 argument there.

    find -print0 | xargs -0 -n1 echo

    echo -n "file '#1'\0file '#2'\0" | xargs -0 -n1 -I{} echo "the file is \"{}\", ok?"


Oh yeah, great I idea. Now insert sorting somewhere between the pipes and use at least one filename with '"' character. Good luck with your xargs. parallel handles all this automatically, you don't even have to know there might be any problems with escaping here.


Thanks - I will take a closer look at parallel.


I guess this is very much a case of YMMV. Virtually all my parallel use cases involve copying files to and from remote machines and starting processes there.


This reminded me to fill a bug for BusyBox/Alpine's version of xargs which does not support parallel operations.

https://bugs.busybox.net/show_bug.cgi?id=9511


GNU Parallel - an amazing tool with the most user unfriendly brick-wall-in-your-face documentation imaginable. Shame really - it's great.


To be honest the same can be told about many GNU tools. At least myself I still experience this moment of being totally lost infront of the man screen from time to time. Parallel has a nice tutorial https://www.gnu.org/software/parallel/parallel_tutorial.html Have you seen it?


I always get the feeling that most man pages are written for people who already know how to use the commands, rather than people new to them. For example, if I already understand how tar works and have a general idea of how to use it, man tar is great to drill down and find specific options and switches that I need or don't remember, but if I have no idea what tar files are or why would I want them, the man page really doesn't help much in explaining things.

I guess you could argue this is supposed to be so and man pages are doing their job, since they are documentation and not tutorials, but still.


> I always get the feeling that most man pages are written for people who already know how to use the commands, rather than people new to them.

This is actually true. In some of the early Unix documentation or an interview (I can't remember which), one of the core Unix developers said pretty much this, i.e., that man pages were more like a collection of rough notes to help the developers of the Unix system itself not to forget how the commands should be invoked.


Info pages are better for that.


Info pages should be banished from the face of the earth.

https://xkcd.com/912/


I don't get it. The info pages usually have way more information available, and are proper software manuals with plenty of worked out examples, voluminous descriptive text, and hyperlinks to related resources. Try `info sed` and compare that with the bare breakdown of command line options and basic syntax of `man sed`.

I'm not sure what that comic has to do with anything.


Man pages are (in my experience) almost always viewed on a text console or terminal. Info pages work poorly there, and it's just frustrating to be looking for some information in a man page and have to start up some other program with a different UX to complete the task.

Man page standards have places for "See also" and "Examples" and that is good enough.


The options section is no doubt for people, who know the tool. But the huge section with examples is not.

Did you read the "Reader's guide" which is presented before the first option in the man page?


There was a time when many man pages had healthy EXAMPLES section. It seems less frequent now. And info pages are a disaster.

we have bropages for that alas they don't see much activity.


Can I ask to to run these and then comment?

LESS=+/EXAMPLE: man parallel

LESS=+/Reader.s man parallel

man parallel_tutorial


I'm sorry if you got the idea that EXAMPLE section is lacking in parallel manpage.

I wasn't checking that, I was ranting on the general state of man pages. I checked ls, curl, wget, mv - none of which had EXAMPLE sections, and that's where I wrote the comment. Didn't even have parallel on my system.


Gnu man pages almost universally suck. You're expected to use the info docs or their HTMLified versions.


The man page isn't too bad. It is better to just look at examples in my case and see what you can find.

I do agree it needs some work to be a little more approachable.


parallel --help is not friendly, and the man page is too comprehensive - really! - most of it belongs in /usr/share/doc

The examples are a pain to find - jump to the end (here comes the smack in the face) and see pages of differences to other tools.


The page with differences to other tools will be moved to 'man parallel_aternatives' in next version. It made sense when the section was short, but it has grown so big that it can make it harder to navigate.

Thanks for input.

(The trick to find the examples: LESS=+/EXAMPLE: man parallel )


It's a GNU project. It may have a texinfo manual from which the manpage is generated; if so, the info version will have hyperlinks and the like. (But you're probably better off reading the HTML version which should be, as you say, in /usr/share/doc.)


No one looks at texinfo docs.


pinfo is better than info for viewing info pages, but I agree: nobody looks at info pages. It's much easier to google documentation than to navigate to it in info.


I've been using Linux since about 1999 and have never looked at a texinfo page.

Or, at least, I don't think so, because I don't know what they are or how to access them.


At the risk of sounding obnoxious, people might consider writing better documentation - perhaps in a wiki where they can collaborate on it.


I try to convince my teams to use wiki for the first draft and then export to whatever tool you prefer for delivery versions. Sometimes it works, sometimes it doesn't.


A lot of effort has gone into the documentation of GNU Parallel: Intro videos, a one hour tutorial, tons of examples, a man page on the design decisions behind the code, and even a "Reader's guide" in the man-page.

Can you be a bit more specific why you believe it is a "brick-wall-in-your-face"? Did you follow the "Reader's guide", which is before even the first option is introduced?


Should check out the documentation for my Rust implementation of Parallel. Same syntax applies.

https://github.com/mmstick/parallel


> Same syntax applies.

Except that is not entirely true: For instance according to the author the '{= perl expression =}' will probably never be supported.

(Full disclosure: I am the author of GNU Parallel. I fully support building other parallelizing tools, but to avoid user confusion, I would recommend calling them something other than 'parallel' if they are not actually compatible with GNU Parallel).


There is also an in-development GNU Parallel clone/alternative written in Rust. https://github.com/mmstick/parallel


Being written in Rust is not its main feature (and I think you're being down-voted because people don't like fanboys).

This project is cool because it has a really low overhead, which is really cool if you want to parallelize tasks that are not really CPU intensive (but mostly useless if the CPU usage of each task is high).


Rust-parallel _is_ fast, and there is clearly a niche here, that GNU Parallel is unlikely to fill: By design GNU Parallel will never require a compiler; this is so you can use GNU Parallel on old systems with no compilers (Think an old, dusty AIX-box that people have forgotten the root password to). This design decision limits how fast GNU Parallel can be compared to compiled alternatives.

But the main problem with rust-parallel is that it is not compatible with GNU Parallel (and according to the author, it probably never will be 100% compatible). If you use rust-parallel to walk through GNU Parallel's tutorial (man parallel_tutorial) you will see it fails very quickly.

(Full disclosure: I am the author of GNU Parallel. I fully support building other parallelizing tools, but to avoid user confusion, I would recommend calling them something other than 'parallel' if they are not actually compatible with GNU Parallel. History has shown that using the same name will lead to a lot of unnecessary grief: e.g. GNU Parallel vs. Parallel from moreutils).


Developer here. Noticed significant traffic coming from this domain. Not sure if this specific comment is the cause of that though. Kind of difficult to search this site.

Anyway, I did and am continuing to develop this with Rust as one of it's biggest points. It's a cool project to showcase Rust to both users and other software developers. I can use it to teach others Rust by example with well documented code, and possibly gain some buzz for this great language.

That the application has low overhead is mainly a side effect of choosing to develop this application in a low level language like Rust, but the speed is indeed my biggest selling point in general, for now at least. I'm continuing to commit further development into mitigating costs as much as possible.

I can also point out at there is a benefit to using this over the GNU Perl application for large scale computational tasks. Tasks where you are processing hundreds of thousands to hundreds of millions of inputs. This will keep memory consumption low and reduce the amount of CPU cycles required to process those inputs. You could save anywhere from a few minutes to an hour.


Does anyone else remember Slashdot and the endless threads about Beowulf clusters? That was back when "parallel computing" was overly complicated and rather opaque. And most of us had no idea how to take advantage of multiple machines.


Sure do. Makes you feel that the future is now!


I use parallel for pretty much all batching these days. It's useful even when you don't need parallelisation – here's a simple transcoding example:

  parallel -j1 'ffmpeg -i {2} -c:v libx264 -tune film -preset veryslow -crf 18 -vf scale={1} -c:a libfdk_aac -vbr 5 conv/{2.}-{1}.mp4' ::: hd480 hd720 hd1080 ::: *.mp4
Obviously in the real world you would want to take some extra steps to avoid making upscaled versions, but this is just a rough example.


I use it daily for parallel build process and I love it. Easy to set up, easy to deal with. Documentation was tough though.


That's interesting. Make is parallelizable too, although I occasionally run into projects that won't build correctly if you use the feature.


prll https://github.com/exzombie/prll is a very pleasant and approachable alternative that I've used in preference to GNU Parallel.


Just be aware that Stderr will contain stuff you did not ask for: https://www.gnu.org/software/parallel/man.html#DIFFERENCES-B...

The syntax difference is small for simple tasks.


parallel is nice, in the past I would run a small script like https://gist.github.com/CMAD/3077918 because I had to migrate email accounts from a list and it was all very old servers with the package manager broken, good old times, now I will have to run some super convoluted orchestration


cat 100GB_data_file_with_DUPLICATE_lines | parallel --pipe awk \'\!a[\$0]++\' > data_file_with_UNIQUE_lines

is BETTER since it uses less computer resources than

gawk '!a[$0]++' 100GB_data_file_with_DUPLICATE_lines > data_file_with_UNIQUE_lines


The parallel example will definitely not work as you expect, and will likely result in most duplicate lines still being present. When using --pipe this way, if you don't declare --block to be the size of the file (in which case there's no benefit to using parallel), each parallel execution will be run on a separate 1MB (default --block size) chunk of the file before outputting results all separately, then together in a single group (stdout), to the output file.

If you're looking to spread work across CPUs and correctly get the desired output, I'd do something like:

    parallel -a input.txt --pipepart mawk \'\!a[\$0]++\' | mawk '!a[$0]++' > output.txt
I used mawk because it is typically much more performant on large files.


That's a classic misuse of cat.

Just use '<'.


Yeah, but to me, this "misuse" actually makes the input and output clearer than using "<" and ">" together. After all, it places input on the left side of the command, and output on the right side.


Except for GNU Parallel --pipe. When you use '<' what you typically mean is '--pipepart -a file'.


GNU Parallel rules! I made $100K with just a few lines of code involving Parallel.


Tell the story!


thery simple really, but so is UNIX :-)

parallel -j 1000 -a urls.txt {wget --recursive {}|lynx --dump|(other unix shell power tools) > url.txt}

gets you a powerfull web scraper that maxes out your internet connection.

Throw some machine learning on top of it and you can do a lot with that. Can't tell you what exactly I did with the ML that being trade secret about to get a patent, but the rest doesn't involve the parallel anyway.


OK, forget technical details; how's the money made?


better sales targeting


Am I the only person who finds GNU parallel way too complicated? I tried to perform a very easy parallel task with it and spent hours reading the documentation and various tutorials. If a person with Unix command-line skills can't easily pick it up, what's the point of having it?


There do seem to be some very complex use cases. On the other hand, their example of parallel gzip of files seems straightforward:

find . -name '*.html' | parallel gzip --best

Generally, using it in places where you would normally use xargs seems uncomplicated.


Probably a stupid question, but how do the remote machines translate the pathnames to local pathnames? And what happens if they fail?


The example above is just running in parallel locally.


A good start is looking at the section 'Remote execution' in 'man parallel_tutorial'.

Spoiler: You can have GNU Parallel kick the machines if the fail, and rerun the job on another machine.


What was it that you tried to achieve? I find it very easy for trivial parallelisation over files but am quickly lost when it becomes more complex.


Did you start by reading the "Reader's guide" which is presented before the first option is introduced in the man page?


GNU Parallel sucks. Use xargs when possible and paexec when needing fancy features. BTW, paexec even supports piping next to process invocation.


Do you have an example of "piping next to process invocation"? It sounds interesting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: