For GNU Parallel, if you pass it -k it will buffer all of its subcommands' output and print it as if it had all been run serially, removing the need for those subcommands to output anything atomically -- a requirement that usually can't be guaranteed.
This is useful if you are solving embarrassingly parallel problems that generate lots of log data (e.g. timings for benchmark) and write them to stdout/stderr.
AFAIK there is some guaranteed atomicity when writing PIPE_BUF<512 to a pipe, however then your sub programs shouldn't write more than 512 bytes of log-data at any point, I would guess.
I love gnu parallel, but that citation thing is a bit of a drag on parallel. You only have to do this once, but I see why you'd write a blog post to include the flag and not have to discuss the issue.
I wish Ole would relax on this. It's not really that appropriate to demand that anyone using parallel for academics cite a magazine article. It's only appropriate for people doing research in parallel algorithms, and in any case it would make citing easier if there was a journal article to reference.
There's nothing wrong with asking nicely, and I try to spread the word about gnu parallel. But the heavy handed demand is annoying, even if I only deal with it like once a year.
Oh, I've seen the --tollef option in the help doc, but I still didn't realize there was a complete alternative to gnu parallel. Definitely will have to try.
Mr. Hess makes a lot of good points, most I think I agree with, so I can sympathize with his frustration.
The point about transferring files seems interesting. I haven't used gnu parallel in remote/ssh mode aside from simple tests, but that seems wicked useful for some situations. Is there a real and simple alternative to running gnu parallel on multiple remote hosts? Is that something the moreutils version can also do?
The pypi page for pssh links to an old (dead?) Google Code repository. Do you know if the project is still being maintained?
For most uses I find that ansible ad-hoc commands give a nice balance -- just a tiny bit of learning curve, and only a little bit different than typing the single SSH command that you want to run.
I use Linux but I've never used parallels. At first, I thought you were making that up.
But, I looked at your username and realized I have seen you post before - and I couldn't imagine you being dishonest about this. You just don't seem like the type to make this sort of stuff up.
So, curious, I headed to Google.
Sure enough, and for those who don't know, the author wants you to cite parallels when you use it in conjunction with something that is going to be published. Now, to be fair, I've always included citation of the tools involved when I published. However, nobody has ever asked me to directly and nobody has nagged me to do so.
It's a bit taboo, I guess. You can pay money and not have to cite it. You can also feed it the --will-cite argument and it won't bug you. Still, it is in probably in poor taste. If nothing else, it is unconventional.
I didn't find a whole lot of complaints about it. The only hyperbolic outrage was, interestingly enough, someone on HN who was quite uptight about it.
I guess Ole is within his rights to do that, but it is certainly unconventional. I wonder if anyone takes his inflated number of citations as a negative against him?
From what I've now read, there don't seem to be a whole lot of complaints. Even the mailing list thread was short and didn't contain any drama. The only drama I found was the post on HN.
I have discussed this before, I really hope the hyperbolic outrage wasn't me though! ;) I like parallel, and this issue is relatively minor. But I admit the language of the nag bothers me just a smidge.
> You can pay money and not have to cite it.
You can also use it without paying. The nag is neither a license nor contract, and the proposed money isn't a required fee. The language of the notice makes it sound like you have to pay, and makes it sound like part of the license.
> I guess Ole is within his rights
I agree completely. And I'm in favor of him getting what he wants too! He's right that PR is helpful for his cause, and I appreciate his cause. I just wish he'd ask without the nag, and use softer language.
Using parallel isn't a reason to cite it, any more than using LaTeX or emacs would be a reason to cite LaTeX or emacs. And if gnu parallel counts as previous research, it's very unlikely he has to ask.
> I didn't find a whole lot of complaints about it.
There are some, and I'd agree it's not a huge issue, by and large. The main discussions I've seen have been with package maintainers. It is a small issue, and the few times I have brought it up, I've had a ton of confirmation that it bothers other people a similarly: a little bit.
No, it wasn't you. I didn't mention their username as I don't want to make it look like I'm calling someone out in a thread and I don't want to appear to be harassing other site members.
They haven't posted in a couple of years. Instead of calling them out by name, I'll just go ahead and link to the thread:
At first, I was thinking it couldn't be true - but then I noticed your username. As I mentioned, I've seen you post before and I didn't think you'd be dishonest. So, I figured I'd go investigate.
It is definitely unconventional. Another poster mentioned it violated GNUs guidelines but I doubt that is true. The GNU folks are, shall we say, really big on remaining ethically consistent. If it violated their guidelines, I'm really certain they'd remove it from their site.
I am not sure I like the precedent that it sets, though I am not seeing it copied by other projects.
Like you, I get why they might do it but, now that I know about it, I don't like the idea of it. I'm not irate, or even perturbed really. It absolutely wouldn't stop me from using the software if I needed it.
It does make me curious as to when people stopped listing their tools. I have academic publications with my name on them, so I know you should cite your tools so that others can reproduce your work.
That is the whole reason you cite your tools, reproducability. It absolutely shouldn't be because of academic fame for the creator of the tools. It shouldn't be about the toolmaker at all. You cite tools, and versions of said tools, so that others can reproduce your work and, if they can't, they can see if it isn't reproducible due to part of the tool chain being different.
In fact, that's one of the great reasons for preferring permissability-based software licenses - so that you have permission to share the exact version of the tools you used to do your research. Citing the toolset just to ensure the author gets a higher counted number is less than optimal.
I realize this veers off-topic but I wonder where things went wrong? I finished my dissertation in the early 1990s and haven't published anything since. I took my newly minted doctorate and headed for the private sector. Perhaps someone who remained in academia knows?
I am really curious as to when this changed and why it changed? Begging for citations, for something that should give very little credit, shouldn't be a thing. Worse, the situation should never be that someone feels pressured to do so. How did it even get to this point? What did I miss?
I have to assume that gnu parallel is currently meeting gnu's guidelines here, perhaps by allowing users to mute the nag. I do wonder if this particular guideline was written specifically for parallel.
> It does make me curious as to when people stopped listing their tools.
That's an interesting question! Certainly some people still do list their tools.
My story isn't entirely different than yours, but I've continued to publish occasionally since my thesis, as well as edit and review papers for several journals.
For me, I think it comes down to methodology. If a program used represents previous research, then it should be cited. If a program used does factor into the methodology of the paper, then citing it for reproducibility is a great idea. If either of those are true with parallel, I will absolutely cite gnu parallel.
But parallel is generally used only to speed things up, and does not affect the methodology at all. I guess it could be nice for reproducibility, in case there are bugs. But if parallel is only used incidentally like that, and the publication isn't about parallelism, and parallelism doesn't affect the output, it's not something you should cite. The journals I've submitted to would normally ask you to remove noise like that if you included it. (And as an editor, I've asked people to remove non-academic references, or at the very least footnote them instead.)
I realize it's not a widespread problem, but take the idea to a logical extreme -- do I cite all my tools? Should I cite my version of Linux, and include that I used zsh instead of bash? I process my output using sed, Perl, awk, Python, numpy, and I did my user experiements using Chrome 55 with JavaScript and Angular. The list of tools I use is very long. As a paper reviewer or editor, I'd be annoyed if I had to wade through that. And the number of tools that affect the methodology is small, those are the ones I care about.
An author should simply include their entire source code as a single citation, rather than individually cite any tools. That satisfies reproducibility without adding any unnecessary noise to the appendix, or treating gnu parallel as a part of the research when it's used only incidentally.
Also, Ole's really asking for PR more than he's asking for academic citations. There are lots of other ways to give him PR and help him. I feel like the emphasis on academics in his citation nag is slightly separate from the overall goal.
If you look there, he says, "... please cite as per ..." So, it isn't required as a part of the license or condition of use. It's just begging.
I suspect that's how it's not in violation of the GNU terms and GPLv3, but I'm not an expert.
And you should cite what version of Perl you used, for example. You should also ensure the source for Perl is available for future researchers. That's why open source is so valuable in academia.
Obviously there is a reasonable limit. If it potentially had an impact, cite it. The key word is reasonable.
I don't know enough about parallel to comment about the viability of it impacting the output. I still find it alarming that they feel compelled to beg for citations.
Also, yeah, when I cited software, it went into the acknowledgments section. This being a different era, I included my email address (the Internet was not world wide back then, so to speak) so that people could contact me and I could mail them a copy of software that I wrote, both compiled and the source.
I'd cite any software that was reasonable to consider as relevant. If possible, I'd cite a scientific article, where possible. A couple of times, the software want necessarily all that important for the science, but I'd found it so useful that I'd cite it - though that was more to draw attention to it.
I do now wonder if it is a generational thing. Namely, when I was still in academia, there wasn't as much software as there is now. The use of computers was still fairly new. Citing our software tools was a bit more unique and citing COTS software was probably even more rare.
That may have something to do with it. While I still read a lot of papers, I'm completely removed from academia. I suspect I missed something along the way. It has been nearly 30 years - that's eons in the world of computers.
I would totally recommend giving it a try anyway, I don't mean to turn anyone away. Gnu parallel is really nice. I use it a ton for things like batch image resizing. I wish the citation request would take another form instead, but it really is minor.
You would then not be allowed to call it GNU Parallel due to possible trademark confusion. This also why we have names like CentOS (not RedHat Free) and IceCat (not Firefox Free).
There was a court case in Germany about this (for some CMS tool - IIRC) where a forker used a similar name. The verdict was pretty clear: Forking was OK (copyright law - permission by GPL), but keeping the name was not (trademark law - no permission by GPL).
Good Q! I don't know, but I also feel like that could be a bit dirty, without having other reasons to fork. I might not want to encourage or support that.
There are some (non-fork) projects that provide the same functionality, with the stated motivation being in part because of the citation thing.
Nothing homebrew can do about it, of course, but this illustrates one of the implications of parallel doing something unexpected; package managers have to field the complaints.
I’ll look again later today. I tried to find one I saw earlier quickly when I replied above, but I didn’t see it. I think it was also called “parallel” and it mentioned the citation being a factor.
If the author wants to extract ten seconds of your time once a year, that's just the price of admission. They're not uptight; they're trying to advance their career.
This is a very important problem. We devs love to make tools, and we love to do it for free. But you can't eat idealism. Parallel is a nice tool, but it's probably not as impressive as it seems when it comes to landing a job.
You're free to decry the author and to point out alternatives. It's practically tradition. But I identify with where the author is coming from. How would you feel if you spent a bunch of time on a tool and the top comment blasts it for reasons unrelated to its merits?
> How would you feel if you spent a bunch of time on a tool and the top comment blasts it for reasons unrelated to its merits?
I love parallel, and I spread the word and participate in the PR that Ole is asking for.
> If the author wants to extract ten seconds of your time once a year, that's just the price of admission.
The ten seconds is not the problem. The problems are:
- The citation nag notice is confusing people on their legal responsibilities. It has prevented people from using it due to the fear of liability it causes. (RE: "If you pay 10000 EUR you should feel free to use GNU Parallel without citing." ... "If you use '--will-cite' in scripts you are expected to pay the 10000 EUR, because you are making it harder to see the citation notice.")
- Ole's citation is not from a scientific or peer-reviewed publication. It simply cannot be used in some contexts. In many contexts, a citation of Parallel is highly inappropriate.
- The small amount of time it takes to either cite Ole or read and understand this thread is largely irrelevant to the question of whether a citation is either warranted or appropriate or allowed (by the actual GNU license, by a user's employer, and/or by the publication they're submitting to). Understanding the license is a one-time event, and it's very important for the license to be legally clear. Parallel's citation notice is causing license confusion. The question is whether Parallel can and should be used without having to worry (forever) about the legal consequences of the contract I've agreed to by using the software.
- This approach isn't scalable. If other GNU tools, or other free software started using the same language as Ole, it would cause widespread problems. Mr. Hess is correct, this is very antithetical to Unix.
"Tut-tutting" is accurate. Your criticisms are mostly mistaken: He isn't demanding a citation. Let's look at what he's actually doing.
The best idea I have come up with so far is printing a citation notice
on STDERR if output is to a terminal when GNU Parallel starts. The
notice will not be printed if STDERR is redirected (to a pipe or a
file), it will also not be printed if --no-notice is given and it can
be disabled completely by running --bibtex once:
"""
When using GNU Parallel to process data for publication please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.
This helps funding further development.
To get rid of this notice run 'parallel --bibtex' once or use '--no-notice'.
"""
That's just about the most straightforward nag screen I've ever seen. It includes both the rationale and how to disable it permanently. Run --bibtex once!
--
> The citation nag notice is confusing people on their legal responsibilities. It has prevented people from using it due to the fear of liability it causes.
Whoever decided not to use parallel due to this was being silly. There are no legal concerns.
> Ole's citation is not from a scientific or peer-reviewed publication. It simply cannot be used in some contexts. In many contexts, a citation of Parallel is highly inappropriate.
He is asking for a citation where possible, to help him out.
> This approach isn't scalable. If other GNU tools, or other free software started using the same language as Ole, it would cause widespread problems.
But other tools don't do that. You can accept this one case to help this one author.
> Mr. Hess is correct, this is very antithetical to Unix.
Y'know what's more antithetical to Unix? When the authors can't build tools because it makes no sense to spend their limited time working on projects that further label them as "not a webdev, so why would I hire them?"
There are still plenty of jobs for non-webdevs, but millions of new programmers have popped up in recent years. Those non-webdev jobs are getting sparser and more competitive.
This argument is full of holes, but the overall point is that Unix as an ideology has been steadily losing ground for the last decade. It's not financially lucrative to be a Unix ideologue. You could argue that that's just the price of adhering to the philosophy. But when a project is actively shunned merely for making a polite citation request, what are onlookers supposed to think?
This is just about the least evil type of hustling that the author could do, yet it's still being treated as some kind of offense. How dare he ask you to run --bibtex.
I disagree with the article that you linked. Filenames are variable names that you get to choose. Half of the game is won by chosing them wisely, so that they are convenient to use.
The POSIX definition of a "line" is "A sequence of zero or more non- <newline> characters plus a terminating <newline> character"[0]. I don't think it's counterintuitive at all for POSIX utilities to respect that definition.
I realize it's somewhat off topic, but I feel like Joyent Manta deserves an honorable mention. It's an S3 style object store, but you can spin up containers on top of objects and do massively parallel computations with Unix tools.
Edit: here's a quick sample (file is read from tmpfs on a 4-core i7-5500U):
$ time lbzip2 < linux-4.13.3.tar > /dev/null
real 0m17.410s
user 1m7.799s
sys 0m0.283s
$ time pbzip2 < linux-4.13.3.tar > /dev/null
real 0m30.556s
user 1m57.557s
sys 0m2.169s
I have known situations where the compressed output of pbzip2 was not readable by some .NET library a customer was using (I'm afraid I can't remember which one). Fortunately an alternative multithreaded bzip2 implementation, http://lbzip2.org/, did not suffer from this problem so just in case...
By compressing multiple (large) files sequentially you'd be able to gradually free some disk space much sooner, and less free space will be needed to write compressed data to.
Large files / sets of files with 7zip means you can use all files as a dictionary for the compression. (across-file compression, called "solid compression")
However, zip does not supports solid compression. Which creates the oddity of zip'ing twice can reduce your file size because multiple duplicate files may have the same compression but they are stored separately (and the second pass then notices the similarity).
The downside of solid compression is you have to extract any files related to that block. But with modern computers that's not as bad as it used to be, and modern 7zip doesn't extract "all" files, only the ones affected by that block.
True! This could be important if you're running low. I've been curious - does doing something like parallel bzip2 cause more fragmentation than a serial approach, or are the file systems and drives pretty good at dealing with heavy parallel write?
The performance is apparently close to xargs (if speed is key, why not just use xargs?). Rust parallel does, however, have some issues that would rule it out for me:
And make -j will execute the rules & build targets in parallel. Once I built a general batch parallel system using make that could pause & resume parallel jobs by using each job's log file as the make target. Then I discovered gnu parallel and scrapped my project.
Ha! Crazy, I didn't know that, thanks for the pointer. I recall having some issues with scaling to very large jobs, when I got lists of targets too long for make to deal with. (I think... I could be mis-remembering the details, but something in my pipeline would fail with really big batches.) Maybe parallel split away from make when it ran into similar issues, I wonder.
Dependencies in batch resource managers for distributed parallel jobs (or simpler ones) are standard practice. https://arc.liv.ac.uk/SGE/htmlman/htmlman1/qmake.html is an old adaptation of GNU Make which works within a job.
Gah story of my life - hours/days of grueling work trying to solve a problem, then someone say 'why don't you just use $tool_that_does_exactly_what_you_want_but_better?'. Code goes in the bin.
Very good point. I added a whole section to the article, implementing the counting lines example with `make -j`,
which performs just as well as `xargs -P`
The simplest use case for xargs is that you have a list of filenames and you want to feed that list to a command that only accepts filenames as CLI parameters:
I.e., 'find' generates a list of filenames.
But 'stat' (for example) only accepts file names as parameters on the command line.
You combine these two using 'xargs':
find -type f -print0 | xargs -0 stat --format="%y\t%n"
And you get an output of human readable modification times, a tab, the filename and a newline.
Note, the "-print0" to find instructs find to output null terminated filenames, and the -0 to xargs instructs xargs to expect to receive null terminated filenames. These switches (they may be specific to GNU coreutils) allow for handling filenames with any possible character in them correctly (i.e., spaces cause no troubles).
For this use case, remembering how/when to use xargs is easy. The more esoteric usages are the ones where "man xargs" is often handy.
You and me, both of us. I once combined rpm -e with xargs rpm -ql to the tune of getting all packages uninstalled from the system in one swift command.
It was a test system so all good. But I'm still not sure what I was thinking.
The mentioned tool 'turbo-linecount' is, well, odd. It targets what I'd think is a very niche application of counting lines very fast (a domain usually limited by I/O speed), using a rather complex design... and then manages to throw much of the gained advantage away by using what is perhaps one of the slowest ways of counting newlines in a buffer.
Need to be aware of the buffer size on the pipe. It can be one of those issues that never comes up until you just cross the threshold, then everything fails ungracefully.
;calculate primes in a range
(define (primes from to)
(local (plist)
(for (i from to)
(if (= 1 (length (factor i)))
(push i plist -1)))
plist))
(set 'start (time-of-day))
; start child processes
(spawn 'p1 (primes 1 1000000))
(spawn 'p2 (primes 1000001 2000000))
(spawn 'p3 (primes 2000001 3000000))
(spawn 'p4 (primes 3000001 4000000))
; wait for a maximum of 60 seconds for all tasks to finish
; returns true if all finished in time
(sync 60000)
; p1, p2, p3 and p4 now each contain a lists of primes
(println "time spawn: " (- (time-of-day) start))
(println "time simple: " (time (primes 1 4000000)))
(exit)
My biggest gripe with UNIX is that a command like:
(sleep 5 && echo hello) > test.txt
Causes a race condition where test.txt is created empty, then 5 seconds later written with "hello". Try it without parentheses to see it wait. UNIX doesn't use our intuitive notion of order of operations on its piping, it just runs everything simultaneously. This allowed for tremendous efficiency and concurrency but it's hard to fathom how much this has cost us in bugs and lost development time.
I feel like this was a lost opportunity because it prevented the Actor model (as seen in Erlang and Go) from really taking off decades ago. Perhaps this bug/feature was one of the motivations for commands like "parallel".
Does anyone have a general workaround for this problem? Some command that we could insert in the chain to force a wait, without having to install any external tools? Thanx!
Edit: I'm having a hard time explaining how insidious this race condition is for people who haven't encountered it yet. The gist of it is that file descriptors aren't opened when a pipe sends its first byte, they are opened when the shell command is interpreted. I'm also having a hard time finding examples, here is one I think, though there are many, many others: https://unix.stackexchange.com/questions/174788/am-i-hitting...
"our intuitive notion of order of operations on its piping" -- I don't know what your intuitive notion is, and there is no piping in your example. Piping would be:
sleep 5 | echo hello (this is an odd example, but it works)
I also don't see any race condition in your example. A race would be two processes accessing the same data.
They are redirects, and the way redirects like > and < work is to:
1. open a file
2. change the global process state
3. run arbitary commands, either builtin (echo) or external (ls). External commands require a new process and inherit their process state (descriptor table, where stdout and stdin are conected to) from the parent shell.
4. restore the global process state
5. close the file
If you understand that, it might be clear why the file test.txt is opened before sleep. There's really no other way it can work, given the fact that you can execute arbitrary commands inside the redirect.
I address some related things about redirects here: "Avoid Directly Manipulating File Descriptors in Shell"
The more unintuitive thing about shell is that what you wrote is technically better as:
{ sleep 5 && echo hello; } > test.txt
() doesn't mean "grouping" like in other languages, it means subshell. {} means grouping, and it has an odd way of parsing, where ; is required before }, but it isn't required before ). This is because () are operators and {} aren't.
I wrote about this here, "the subshell construct () is confused with the grouping construct { }":
EDIT: I will say that I was very confused by the syntax of > and < when I started learning shell. It does seem like ">" means "this goes into that". In fact I think I internalized that from MS-DOS.
But that doesn't explain what these mean, and why they are equivalent:
tac >out.txt <in.txt
tac <in.txt >out.txt
They are best read as prefix operators that modify the state of the process invocation.
What's even more confusing is that you can do:
tac 1>out.txt 0<in.txt
tac 0<in.txt 1>out.txt
Then they look like binary operators again. But the descriptor is "stuck on", it's actually part of the operator lexically.
It's a capable tool, if somewhat... complicated.