Parallel processing with Unix tools

Filligree · on Sept 23, 2017

For GNU Parallel, if you pass it -k it will buffer all of its subcommands' output and print it as if it had all been run serially, removing the need for those subcommands to output anything atomically -- a requirement that usually can't be guaranteed.

It's a capable tool, if somewhat... complicated.

protomikron · on Sept 23, 2017

Thanks, I did not know that.

This is useful if you are solving embarrassingly parallel problems that generate lots of log data (e.g. timings for benchmark) and write them to stdout/stderr.

AFAIK there is some guaranteed atomicity when writing PIPE_BUF<512 to a pipe, however then your sub programs shouldn't write more than 512 bytes of log-data at any point, I would guess.

dahart · on Sept 23, 2017

> parallel --will-cite

I love gnu parallel, but that citation thing is a bit of a drag on parallel. You only have to do this once, but I see why you'd write a blog post to include the flag and not have to discuss the issue.

I wish Ole would relax on this. It's not really that appropriate to demand that anyone using parallel for academics cite a magazine article. It's only appropriate for people doing research in parallel algorithms, and in any case it would make citing easier if there was a journal article to reference.

There's nothing wrong with asking nicely, and I try to spread the word about gnu parallel. But the heavy handed demand is annoying, even if I only deal with it like once a year.

gioele · on Sept 23, 2017

For 99% of the use cases moreutils' parallel (AKA Tollef parallel) is an equal, or even better, alternative.

"Interesting" thread from 2012 about replacing moreutils' parallel with GNU parallel in Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=597050 (TLDR: skip to message #75).

dahart · on Sept 23, 2017

Oh, I've seen the --tollef option in the help doc, but I still didn't realize there was a complete alternative to gnu parallel. Definitely will have to try.

Mr. Hess makes a lot of good points, most I think I agree with, so I can sympathize with his frustration.

The point about transferring files seems interesting. I haven't used gnu parallel in remote/ssh mode aside from simple tests, but that seems wicked useful for some situations. Is there a real and simple alternative to running gnu parallel on multiple remote hosts? Is that something the moreutils version can also do?

cat199 · on Sept 23, 2017

For the pure SSH/SCP example there is PSSH/PSCP:

https://pypi.python.org/pypi/pssh/2.2

rdist I believe can operate in parallel as well.

Beyond this, devops tools e.g. ansible/puppet, etc. but they require a bit more or alot more setup, depending on the tool.

cvwright · on Sept 23, 2017

The pypi page for pssh links to an old (dead?) Google Code repository. Do you know if the project is still being maintained?

For most uses I find that ansible ad-hoc commands give a nice balance -- just a tiny bit of learning curve, and only a little bit different than typing the single SSH command that you want to run.

http://docs.ansible.com/ansible/latest/intro_adhoc.html

deno · on Sept 23, 2017

There’s also Rust Parallel[1]. It’s really fast, so it makes sense to use even for very quick things (but you might be running a lot so it adds up).

Overall it has become my preferred Parallel implementation.

[1] https://github.com/mmstick/parallel

KGIII · on Sept 23, 2017

I use Linux but I've never used parallels. At first, I thought you were making that up.

But, I looked at your username and realized I have seen you post before - and I couldn't imagine you being dishonest about this. You just don't seem like the type to make this sort of stuff up.

So, curious, I headed to Google.

Sure enough, and for those who don't know, the author wants you to cite parallels when you use it in conjunction with something that is going to be published. Now, to be fair, I've always included citation of the tools involved when I published. However, nobody has ever asked me to directly and nobody has nagged me to do so.

If anyone is unfamiliar:

https://www.gnu.org/software/parallel/

This appears to be the original thread about it:

https://lists.gnu.org/archive/html/parallel/2013-11/msg00006...

It's a bit taboo, I guess. You can pay money and not have to cite it. You can also feed it the --will-cite argument and it won't bug you. Still, it is in probably in poor taste. If nothing else, it is unconventional.

I didn't find a whole lot of complaints about it. The only hyperbolic outrage was, interestingly enough, someone on HN who was quite uptight about it.

I guess Ole is within his rights to do that, but it is certainly unconventional. I wonder if anyone takes his inflated number of citations as a negative against him?

From what I've now read, there don't seem to be a whole lot of complaints. Even the mailing list thread was short and didn't contain any drama. The only drama I found was the post on HN.

dahart · on Sept 23, 2017

I have discussed this before, I really hope the hyperbolic outrage wasn't me though! ;) I like parallel, and this issue is relatively minor. But I admit the language of the nag bothers me just a smidge.

> You can pay money and not have to cite it.

You can also use it without paying. The nag is neither a license nor contract, and the proposed money isn't a required fee. The language of the notice makes it sound like you have to pay, and makes it sound like part of the license.

> I guess Ole is within his rights

I agree completely. And I'm in favor of him getting what he wants too! He's right that PR is helpful for his cause, and I appreciate his cause. I just wish he'd ask without the nag, and use softer language.

Using parallel isn't a reason to cite it, any more than using LaTeX or emacs would be a reason to cite LaTeX or emacs. And if gnu parallel counts as previous research, it's very unlikely he has to ask.

> I didn't find a whole lot of complaints about it.

There are some, and I'd agree it's not a huge issue, by and large. The main discussions I've seen have been with package maintainers. It is a small issue, and the few times I have brought it up, I've had a ton of confirmation that it bothers other people a similarly: a little bit.

KGIII · on Sept 23, 2017

No, it wasn't you. I didn't mention their username as I don't want to make it look like I'm calling someone out in a thread and I don't want to appear to be harassing other site members.

They haven't posted in a couple of years. Instead of calling them out by name, I'll just go ahead and link to the thread:

https://news.ycombinator.com/item?id=9521973

At first, I was thinking it couldn't be true - but then I noticed your username. As I mentioned, I've seen you post before and I didn't think you'd be dishonest. So, I figured I'd go investigate.

It is definitely unconventional. Another poster mentioned it violated GNUs guidelines but I doubt that is true. The GNU folks are, shall we say, really big on remaining ethically consistent. If it violated their guidelines, I'm really certain they'd remove it from their site.

I am not sure I like the precedent that it sets, though I am not seeing it copied by other projects.

Like you, I get why they might do it but, now that I know about it, I don't like the idea of it. I'm not irate, or even perturbed really. It absolutely wouldn't stop me from using the software if I needed it.

It does make me curious as to when people stopped listing their tools. I have academic publications with my name on them, so I know you should cite your tools so that others can reproduce your work.

That is the whole reason you cite your tools, reproducability. It absolutely shouldn't be because of academic fame for the creator of the tools. It shouldn't be about the toolmaker at all. You cite tools, and versions of said tools, so that others can reproduce your work and, if they can't, they can see if it isn't reproducible due to part of the tool chain being different.

In fact, that's one of the great reasons for preferring permissability-based software licenses - so that you have permission to share the exact version of the tools you used to do your research. Citing the toolset just to ensure the author gets a higher counted number is less than optimal.

I realize this veers off-topic but I wonder where things went wrong? I finished my dissertation in the early 1990s and haven't published anything since. I took my newly minted doctorate and headed for the private sector. Perhaps someone who remained in academia knows?

I am really curious as to when this changed and why it changed? Begging for citations, for something that should give very little credit, shouldn't be a thing. Worse, the situation should never be that someone feels pressured to do so. How did it even get to this point? What did I miss?

dahart · on Sept 23, 2017

> Another poster mentioned it violated GNUs guidelines but I doubt that is true.

I mentioned it too in this thread, here's what I am referring to:

https://www.gnu.org/licenses/gpl-faq.en.html#RequireCitation

I have to assume that gnu parallel is currently meeting gnu's guidelines here, perhaps by allowing users to mute the nag. I do wonder if this particular guideline was written specifically for parallel.

> It does make me curious as to when people stopped listing their tools.

That's an interesting question! Certainly some people still do list their tools.

My story isn't entirely different than yours, but I've continued to publish occasionally since my thesis, as well as edit and review papers for several journals.

For me, I think it comes down to methodology. If a program used represents previous research, then it should be cited. If a program used does factor into the methodology of the paper, then citing it for reproducibility is a great idea. If either of those are true with parallel, I will absolutely cite gnu parallel.

But parallel is generally used only to speed things up, and does not affect the methodology at all. I guess it could be nice for reproducibility, in case there are bugs. But if parallel is only used incidentally like that, and the publication isn't about parallelism, and parallelism doesn't affect the output, it's not something you should cite. The journals I've submitted to would normally ask you to remove noise like that if you included it. (And as an editor, I've asked people to remove non-academic references, or at the very least footnote them instead.)

I realize it's not a widespread problem, but take the idea to a logical extreme -- do I cite all my tools? Should I cite my version of Linux, and include that I used zsh instead of bash? I process my output using sed, Perl, awk, Python, numpy, and I did my user experiements using Chrome 55 with JavaScript and Angular. The list of tools I use is very long. As a paper reviewer or editor, I'd be annoyed if I had to wade through that. And the number of tools that affect the methodology is small, those are the ones I care about.

An author should simply include their entire source code as a single citation, rather than individually cite any tools. That satisfies reproducibility without adding any unnecessary noise to the appendix, or treating gnu parallel as a part of the research when it's used only incidentally.

Also, Ole's really asking for PR more than he's asking for academic citations. There are lots of other ways to give him PR and help him. I feel like the emphasis on academics in his citation nag is slightly separate from the overall goal.

KGIII · on Sept 23, 2017

I think he is asking for the citations and not requiring them - if I'm reading his post correctly.

https://www.gnu.org/software/parallel/

If you look there, he says, "... please cite as per ..." So, it isn't required as a part of the license or condition of use. It's just begging.

I suspect that's how it's not in violation of the GNU terms and GPLv3, but I'm not an expert.

And you should cite what version of Perl you used, for example. You should also ensure the source for Perl is available for future researchers. That's why open source is so valuable in academia.

Obviously there is a reasonable limit. If it potentially had an impact, cite it. The key word is reasonable.

I don't know enough about parallel to comment about the viability of it impacting the output. I still find it alarming that they feel compelled to beg for citations.

Also, yeah, when I cited software, it went into the acknowledgments section. This being a different era, I included my email address (the Internet was not world wide back then, so to speak) so that people could contact me and I could mail them a copy of software that I wrote, both compiled and the source.

I'd cite any software that was reasonable to consider as relevant. If possible, I'd cite a scientific article, where possible. A couple of times, the software want necessarily all that important for the science, but I'd found it so useful that I'd cite it - though that was more to draw attention to it.

I do now wonder if it is a generational thing. Namely, when I was still in academia, there wasn't as much software as there is now. The use of computers was still fairly new. Citing our software tools was a bit more unique and citing COTS software was probably even more rare.

That may have something to do with it. While I still read a lot of papers, I'm completely removed from academia. I suspect I missed something along the way. It has been nearly 30 years - that's eons in the world of computers.

jwilk · on Sept 23, 2017

OK, that has killed all my desire to ever use this software.

Most users won't ever publish anything. Why alienate them?

dahart · on Sept 24, 2017

I would totally recommend giving it a try anyway, I don't mean to turn anyone away. Gnu parallel is really nice. I use it a ton for things like batch image resizing. I wish the citation request would take another form instead, but it really is minor.

LeoPanthera · on Sept 23, 2017

It's GPL, right? Why hasn't anyone forked it with the citation requirement removed?

SamJson · on Sept 24, 2017

You would then not be allowed to call it GNU Parallel due to possible trademark confusion. This also why we have names like CentOS (not RedHat Free) and IceCat (not Firefox Free).

There was a court case in Germany about this (for some CMS tool - IIRC) where a forker used a similar name. The verdict was pretty clear: Forking was OK (copyright law - permission by GPL), but keeping the name was not (trademark law - no permission by GPL).

dahart · on Sept 23, 2017

Good Q! I don't know, but I also feel like that could be a bit dirty, without having other reasons to fork. I might not want to encourage or support that.

There are some (non-fork) projects that provide the same functionality, with the stated motivation being in part because of the citation thing.

jwilk · on Sept 23, 2017

Links to the other projects?

dahart · on Sept 24, 2017

Here's one I was thinking of: https://github.com/gdm85/coshell

The comment about motivation came from a bug report about parallel being "chatty": https://github.com/Homebrew/legacy-homebrew/issues/29060

Nothing homebrew can do about it, of course, but this illustrates one of the implications of parallel doing something unexpected; package managers have to field the complaints.

jwilk · on Sept 24, 2017

> Return value will be the sum of exit values of each command.

That's unforuntate design choice.

Exit status is 8-bit only, and some programs exit with high numbers¹, so the sum could overflow easily.

¹ For example, when a Perl program dies, exit status is 255.

SamJson · on Sept 24, 2017

Someone has compiled a big list of alternatives: https://www.gnu.org/software/parallel/parallel_alternatives....

Most of the tools look unmaintained (last release over a year ago).

dahart · on Sept 23, 2017

I’ll look again later today. I tried to find one I saw earlier quickly when I replied above, but I didn’t see it. I think it was also called “parallel” and it mentioned the citation being a factor.

vermaden · on Sept 23, 2017

Just put this line into ~/.xinitrc or ~/.zshrc or other file that is executed BEFORE you start CLI:

echo 'will cite' | parallel --citation 1> /dev/null 2> /dev/null &

tokenizerrr · on Sept 23, 2017

That slows down the startup of my terminal, just to appease the developer of some tool I probably won't even use that session. No thanks.

The nagging notice should just be removed. It's not like I'm ever going to publish a paper anyway.

SamJson · on Sept 24, 2017

Author explains background:

https://www.gnu.org/software/parallel/parallel_design.html#C...

bamboozled · on Sept 23, 2017

Strongly agree with this!

sillysaurus3 · on Sept 23, 2017

If the author wants to extract ten seconds of your time once a year, that's just the price of admission. They're not uptight; they're trying to advance their career.

(Info: https://lists.gnu.org/archive/html/parallel/2013-11/msg00006...)

This is a very important problem. We devs love to make tools, and we love to do it for free. But you can't eat idealism. Parallel is a nice tool, but it's probably not as impressive as it seems when it comes to landing a job.

You're free to decry the author and to point out alternatives. It's practically tradition. But I identify with where the author is coming from. How would you feel if you spent a bunch of time on a tool and the top comment blasts it for reasons unrelated to its merits?

dahart · on Sept 23, 2017

> How would you feel if you spent a bunch of time on a tool and the top comment blasts it for reasons unrelated to its merits?

I love parallel, and I spread the word and participate in the PR that Ole is asking for.

> If the author wants to extract ten seconds of your time once a year, that's just the price of admission.

The ten seconds is not the problem. The problems are:

- The citation nag notice is confusing people on their legal responsibilities. It has prevented people from using it due to the fear of liability it causes. (RE: "If you pay 10000 EUR you should feel free to use GNU Parallel without citing." ... "If you use '--will-cite' in scripts you are expected to pay the 10000 EUR, because you are making it harder to see the citation notice.")

- The citation demand goes against GNU policy. https://www.gnu.org/licenses/gpl-faq.en.html#RequireCitation

- Ole's citation is not from a scientific or peer-reviewed publication. It simply cannot be used in some contexts. In many contexts, a citation of Parallel is highly inappropriate.

- The small amount of time it takes to either cite Ole or read and understand this thread is largely irrelevant to the question of whether a citation is either warranted or appropriate or allowed (by the actual GNU license, by a user's employer, and/or by the publication they're submitting to). Understanding the license is a one-time event, and it's very important for the license to be legally clear. Parallel's citation notice is causing license confusion. The question is whether Parallel can and should be used without having to worry (forever) about the legal consequences of the contract I've agreed to by using the software.

- This approach isn't scalable. If other GNU tools, or other free software started using the same language as Ole, it would cause widespread problems. Mr. Hess is correct, this is very antithetical to Unix.

sillysaurus3 · on Sept 23, 2017

"Tut-tutting" is accurate. Your criticisms are mostly mistaken: He isn't demanding a citation. Let's look at what he's actually doing.

The best idea I have come up with so far is printing a citation notice on STDERR if output is to a terminal when GNU Parallel starts. The notice will not be printed if STDERR is redirected (to a pipe or a file), it will also not be printed if --no-notice is given and it can be disabled completely by running --bibtex once:

""" When using GNU Parallel to process data for publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development.

To get rid of this notice run 'parallel --bibtex' once or use '--no-notice'. """

That's just about the most straightforward nag screen I've ever seen. It includes both the rationale and how to disable it permanently. Run --bibtex once!

--

> The citation nag notice is confusing people on their legal responsibilities. It has prevented people from using it due to the fear of liability it causes.

Whoever decided not to use parallel due to this was being silly. There are no legal concerns.

> Ole's citation is not from a scientific or peer-reviewed publication. It simply cannot be used in some contexts. In many contexts, a citation of Parallel is highly inappropriate.

He is asking for a citation where possible, to help him out.

> This approach isn't scalable. If other GNU tools, or other free software started using the same language as Ole, it would cause widespread problems.

But other tools don't do that. You can accept this one case to help this one author.

> Mr. Hess is correct, this is very antithetical to Unix.

Y'know what's more antithetical to Unix? When the authors can't build tools because it makes no sense to spend their limited time working on projects that further label them as "not a webdev, so why would I hire them?"

There are still plenty of jobs for non-webdevs, but millions of new programmers have popped up in recent years. Those non-webdev jobs are getting sparser and more competitive.

This argument is full of holes, but the overall point is that Unix as an ideology has been steadily losing ground for the last decade. It's not financially lucrative to be a Unix ideologue. You could argue that that's just the price of adhering to the philosophy. But when a project is actively shunned merely for making a polite citation request, what are onlookers supposed to think?

This is just about the least evil type of hustling that the author could do, yet it's still being treated as some kind of offense. How dare he ask you to run --bibtex.

termie · on Sept 23, 2017

Bash’s built-in wait is also handy when you want quick and simple parallelism. http://tldp.org/LDP/abs/html/x9644.html

philsnow · on Sept 23, 2017

I do this pattern a lot

    for x in $(seq 1 10); do long_process file${x} & done; wait

It doesn't take care of not saturating my cpu or anything; when I need to care about that then I try to remember how to use parallel.

enriquto · on Sept 23, 2017

it's actually easier using parallel :

        for x in ...; do echo long_process file$x ; done | parallel -j 8

EDIT: and in your case, it is even easier with xargs, e.g.:

        ls files* | xargs -n 1 -P 8 long_process

sigjuice · on Sept 23, 2017

Wouldn't this fail if your file names have white space?

Anthony-G · on Sept 23, 2017

Parsing `ls` is never a good idea[1]. A more robust way to use globbing in a POSIX shell would be something like this:

    printf "%s\0" files* | xargs -0 -n 1 -P 8 long_process

1. http://mywiki.wooledge.org/ParsingLs

enriquto · on Sept 24, 2017

I disagree with the article that you linked. Filenames are variable names that you get to choose. Half of the game is won by chosing them wisely, so that they are convenient to use.

flatfilefan · on Sept 24, 2017

I use this template to deal with reasonable filenames:

    find ... -print0|parallel --gnu -X -0 -n 1 your_command "{}" \;

And sometimes the names are so screwed that they have to be renamed first, then this can give you a reasonable new name

    $(echo $(basename "{}")|tr '[[:blank:]]' '_'| tr -cd '\/.[[:alnum:]]_-' )

enriquto · on Sept 24, 2017

yes, and rightfully so. If you don't use silly filenames you can write simpler scripts.

kwhitefoot · on Sept 24, 2017

You don't always have the luxury of choosing the names that the files have.

enriquto · on Sept 24, 2017

I actually do.

In the rare cases where I have to work with odd filenames, it is easier to rename those files than to change my clean scripts.

SamJson · on Sept 24, 2017

Ahh, I see a sysadmin with users! :)

kwhitefoot · on Sept 25, 2017

Well, ex-sysadmin for many years now. But yes, real world and all that.

SamJson · on Sept 24, 2017

I would do:

    seq 10 | parallel long_process file{}

Or simply:

    parallel long_process ::: file*

jwilk · on Sept 23, 2017

Note that, somewhat counter-intuitively, what "wc -l" counts is the number of newline characters:

  $ printf 'foo\nbar' | wc -l
  1

There are arguably two lines in the input, but the result is 1.

Without this (mis)feature, "wc -l" would be more difficult to parallelize.

LambdaComplex · on Sept 23, 2017

The POSIX definition of a "line" is "A sequence of zero or more non- <newline> characters plus a terminating <newline> character"[0]. I don't think it's counterintuitive at all for POSIX utilities to respect that definition.

0. http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_...

frou_dh · on Sept 23, 2017

"terminator, not separator" is an easy way to remember it. I would call bar in GP's comment a fragment.

jandrese · on Sept 24, 2017

The only difference is you have to check the last character in the file to determine if it is a newline. Maybe the last 4 characters if it is Unicode.

npx · on Sept 23, 2017

I realize it's somewhat off topic, but I feel like Joyent Manta deserves an honorable mention. It's an S3 style object store, but you can spin up containers on top of objects and do massively parallel computations with Unix tools.

https://apidocs.joyent.com/manta/

mtreis86 · on Sept 23, 2017

Qt multithreading http://doc.qt.io/qt-5/threads-technologies.html

Aria for downloads https://aria2.github.io/

Pigz and pbzip2 for compression https://zlib.net/pigz/ http://compression.ca/pbzip2/

vthriller · on Sept 23, 2017

On multiple occasions I found lbzip2[0] to be faster than pbzip2

[0] http://lbzip2.org

Edit: here's a quick sample (file is read from tmpfs on a 4-core i7-5500U):

  $ time lbzip2 < linux-4.13.3.tar > /dev/null
  real    0m17.410s
  user    1m7.799s
  sys     0m0.283s
  $ time pbzip2 < linux-4.13.3.tar > /dev/null
  real    0m30.556s
  user    1m57.557s
  sys     0m2.169s

gjvc · on Sept 23, 2017

I have known situations where the compressed output of pbzip2 was not readable by some .NET library a customer was using (I'm afraid I can't remember which one). Fortunately an alternative multithreaded bzip2 implementation, http://lbzip2.org/, did not suffer from this problem so just in case...

dahart · on Sept 23, 2017

If you're zipping multiple files, is it better to pbzip2 each file, or to parallel bzip2 them? You wouldn't want to parallel pbzip2, would you?

vthriller · on Sept 23, 2017

By compressing multiple (large) files sequentially you'd be able to gradually free some disk space much sooner, and less free space will be needed to write compressed data to.

katastic · on Sept 23, 2017

Large files / sets of files with 7zip means you can use all files as a dictionary for the compression. (across-file compression, called "solid compression")

However, zip does not supports solid compression. Which creates the oddity of zip'ing twice can reduce your file size because multiple duplicate files may have the same compression but they are stored separately (and the second pass then notices the similarity).

The downside of solid compression is you have to extract any files related to that block. But with modern computers that's not as bad as it used to be, and modern 7zip doesn't extract "all" files, only the ones affected by that block.

dahart · on Sept 23, 2017

True! This could be important if you're running low. I've been curious - does doing something like parallel bzip2 cause more fragmentation than a serial approach, or are the file systems and drives pretty good at dealing with heavy parallel write?

vthriller · on Sept 23, 2017

Also, with modern enough xz-utils you can just use `xz -T` for parallel compression.

wyoh · on Sept 23, 2017

Rust Parallel is nice too: https://github.com/mmstick/parallel

agumonkey · on Sept 23, 2017

Performance of the actual program might be less relevant but it's nice to see it using 1% for itself compared to GNU parallel.

SamJson · on Sept 24, 2017

The performance is apparently close to xargs (if speed is key, why not just use xargs?). Rust parallel does, however, have some issues that would rule it out for me:

https://www.gnu.org/software/parallel/parallel_alternatives....

enriquto · on Sept 23, 2017

do not forget about "make", that will traverse a user-supplied tree of filenames in parallel

dahart · on Sept 23, 2017

And make -j will execute the rules & build targets in parallel. Once I built a general batch parallel system using make that could pause & resume parallel jobs by using each job's log file as the make target. Then I discovered gnu parallel and scrapped my project.

SamJson · on Sept 24, 2017

Fun fact from https://www.gnu.org/software/parallel/history.html

> [GNU Parallel] was originally a wrapper that generated a makefile and used make -j to do the parallelization.

dahart · on Sept 24, 2017

Ha! Crazy, I didn't know that, thanks for the pointer. I recall having some issues with scaling to very large jobs, when I got lists of targets too long for make to deal with. (I think... I could be mis-remembering the details, but something in my pipeline would fail with really big batches.) Maybe parallel split away from make when it ran into similar issues, I wonder.

gnufx · on Sept 23, 2017

Dependencies in batch resource managers for distributed parallel jobs (or simpler ones) are standard practice. https://arc.liv.ac.uk/SGE/htmlman/htmlman1/qmake.html is an old adaptation of GNU Make which works within a job.

anitil · on Sept 24, 2017

Gah story of my life - hours/days of grueling work trying to solve a problem, then someone say 'why don't you just use $tool_that_does_exactly_what_you_want_but_better?'. Code goes in the bin.

pixelbeat · on Sept 24, 2017

Very good point. I added a whole section to the article, implementing the counting lines example with `make -j`, which performs just as well as `xargs -P`

pletnes · on Sept 23, 2017

Or any job in general. Also, make 4.x comes with the handy -O flag which makes the output seem "serial" instead of intermixed between processes.

wookayin · on Sept 23, 2017

This tool is also great: https://github.com/greymd/tmux-xpanes

unixhero · on Sept 23, 2017

One day I have to learn how to use xargs

pwg · on Sept 23, 2017

The simplest use case for xargs is that you have a list of filenames and you want to feed that list to a command that only accepts filenames as CLI parameters:

I.e., 'find' generates a list of filenames.

But 'stat' (for example) only accepts file names as parameters on the command line.

You combine these two using 'xargs':

find -type f -print0 | xargs -0 stat --format="%y\t%n"

And you get an output of human readable modification times, a tab, the filename and a newline.

Note, the "-print0" to find instructs find to output null terminated filenames, and the -0 to xargs instructs xargs to expect to receive null terminated filenames. These switches (they may be specific to GNU coreutils) allow for handling filenames with any possible character in them correctly (i.e., spaces cause no troubles).

For this use case, remembering how/when to use xargs is easy. The more esoteric usages are the ones where "man xargs" is often handy.

cat199 · on Sept 23, 2017

It's pretty much 'apply' for shell arguments

https://en.wikipedia.org/wiki/Apply

use it many places instead of a for loop:

for x in ; do <cmd> $x done

is:

echo |xargs cmd

Assuming 'cmd' can take multiple arguments.

If not, use 'xargs -L 1 cmd' to run cmd 1x per arg.

GNU xargs has -1, but -L 1 is portable to xargs in the BSDs and others.. plus it's good to know -L in case you ever need to -L 2, etc.

using echo as a driver is not the best example, but anyhow.

cat199 · on Sept 23, 2017

also of note is:

-t : print the commands being run

-I %: interpolate file name in command string.

e.g. ls .mp4 |xargs -t -L 1 -I % scp % me@somebox:/movies

to scp individual files (replaced in %) to a destination, one at a time, printing each command as it is run.

Again, not the best example since you can

scp .mp4 me@sombeox:/movies

or rsync, etc. but you get the idea.

bitexploder · on Sept 23, 2017

Find has exec as well. I tend to use this when searching for things.

limaoscarjuliet · on Sept 23, 2017

You and me, both of us. I once combined rpm -e with xargs rpm -ql to the tune of getting all packages uninstalled from the system in one swift command.

It was a test system so all good. But I'm still not sure what I was thinking.

catern · on Sept 24, 2017

See also http://catern.com/posts/pipes.html

YSFEJ4SWJUVU6 · on Sept 24, 2017

The mentioned tool 'turbo-linecount' is, well, odd. It targets what I'd think is a very niche application of counting lines very fast (a domain usually limited by I/O speed), using a rather complex design... and then manages to throw much of the gained advantage away by using what is perhaps one of the slowest ways of counting newlines in a buffer.

tejtm · on Sept 23, 2017

and the humble pipe, another process may start the next stage before the previous one(s) have finished.

grigjd3 · on Sept 23, 2017

Need to be aware of the buffer size on the pipe. It can be one of those issues that never comes up until you just cross the threshold, then everything fails ungracefully.

limaoscarjuliet · on Sept 23, 2017

Consumer task will wait on pipe read and the producer will wait on pipe open, so it is synchronized properly.

gigatexal · on Sept 23, 2017

Command line goals.

known · on Sept 24, 2017

OleTange, author of parallel chips in https://www.reddit.com/r/programming/comments/5x39jh/countin...

alvil · on Sept 23, 2017

I like it using newLISP :)

  ;calculate primes in a range
  (define (primes from to)
    (local (plist)
        (for (i from to)
            (if (= 1 (length (factor i)))
                (push i plist -1)))
        plist))

  (set 'start (time-of-day))

  ; start child processes
  (spawn 'p1 (primes 1 1000000))
  (spawn 'p2 (primes 1000001 2000000))
  (spawn 'p3 (primes 2000001 3000000))
  (spawn 'p4 (primes 3000001 4000000))

  ; wait for a maximum of 60 seconds for all tasks to finish
  ; returns true if all finished in time
  (sync 60000) 

  ; p1, p2, p3 and p4 now each contain a lists of primes
  (println "time spawn: " (- (time-of-day) start))
  (println "time simple: " (time  (primes 1 4000000)))

  (exit)

zackmorris · on Sept 23, 2017

My biggest gripe with UNIX is that a command like:

(sleep 5 && echo hello) > test.txt

Causes a race condition where test.txt is created empty, then 5 seconds later written with "hello". Try it without parentheses to see it wait. UNIX doesn't use our intuitive notion of order of operations on its piping, it just runs everything simultaneously. This allowed for tremendous efficiency and concurrency but it's hard to fathom how much this has cost us in bugs and lost development time.

I feel like this was a lost opportunity because it prevented the Actor model (as seen in Erlang and Go) from really taking off decades ago. Perhaps this bug/feature was one of the motivations for commands like "parallel".

Does anyone have a general workaround for this problem? Some command that we could insert in the chain to force a wait, without having to install any external tools? Thanx!

Edit: I'm having a hard time explaining how insidious this race condition is for people who haven't encountered it yet. The gist of it is that file descriptors aren't opened when a pipe sends its first byte, they are opened when the shell command is interpreted. I'm also having a hard time finding examples, here is one I think, though there are many, many others: https://unix.stackexchange.com/questions/174788/am-i-hitting...

adamkruszewski · on Sept 23, 2017

I think it does, try it as:

sleep 5 && (echo hello > test.txt)

and you'll see it creates the file only after sleep ends.

chubot · on Sept 23, 2017

I don't understand what you're trying to say. I would expect these two commands to behave differently:

    (sleep 5 && echo hello) > test.txt
    sleep 5 && echo hello > test.txt

"our intuitive notion of order of operations on its piping" -- I don't know what your intuitive notion is, and there is no piping in your example. Piping would be:

    sleep 5 | echo hello  (this is an odd example, but it works)

I also don't see any race condition in your example. A race would be two processes accessing the same data.

They are redirects, and the way redirects like > and < work is to:

1. open a file

2. change the global process state

3. run arbitary commands, either builtin (echo) or external (ls). External commands require a new process and inherit their process state (descriptor table, where stdout and stdin are conected to) from the parent shell.

4. restore the global process state

5. close the file

If you understand that, it might be clear why the file test.txt is opened before sleep. There's really no other way it can work, given the fact that you can execute arbitrary commands inside the redirect.

I address some related things about redirects here: "Avoid Directly Manipulating File Descriptors in Shell"

http://www.oilshell.org/blog/2017/08/12.html

-----

The more unintuitive thing about shell is that what you wrote is technically better as:

    { sleep 5 && echo hello; } > test.txt

() doesn't mean "grouping" like in other languages, it means subshell. {} means grouping, and it has an odd way of parsing, where ; is required before }, but it isn't required before ). This is because () are operators and {} aren't.

I wrote about this here, "the subshell construct () is confused with the grouping construct { }":

http://www.oilshell.org/blog/2017/02/06.html

-----

EDIT: I will say that I was very confused by the syntax of > and < when I started learning shell. It does seem like ">" means "this goes into that". In fact I think I internalized that from MS-DOS.

But that doesn't explain what these mean, and why they are equivalent:

    tac >out.txt <in.txt
    tac <in.txt >out.txt

They are best read as prefix operators that modify the state of the process invocation.

What's even more confusing is that you can do:

    tac 1>out.txt 0<in.txt
    tac 0<in.txt 1>out.txt

Then they look like binary operators again. But the descriptor is "stuck on", it's actually part of the operator lexically.

Valid:

    tac 1>out.txt 0< in.txt
    tac 1> out.txt 0< in.txt

Invalid:

    tac 1 >out.txt 0 <in.txt
    tac 1 > out.txt 0 < in.txt

In other words, there can't be any space between the LHS file descriptor and the > or <.

The syntax of shell is horrible, and that's something I would like to fix with Oil shell. But I don't see the problem you are pointing out.

debacle · on Sept 23, 2017

(sleep 5 && echo hello) | > test.txt?

SamJson · on Sept 24, 2017

Workaround:

(rm file; tac > file) < file