If you find yourself searching lots of haystacks, and your needles are just text and not a regex, a better approach is to stuff all the needles into some kind of index, then chop up the haystack into overlapping tiles (of variable width from the smallest needle to the largest), then search each tile against the index of needles. This effectively searches all the needles at once and turns the operation from O(n) where n is the number of needles to O(m) where m is the number of tiles in haystack.txt.
It may seem to be a trivial difference, but then you can search multiple haystacks at once fairly easily, and this approach scales to hundreds of millions of needles at once. The code for it isn't very difficult either, heck you can just use an in memory SQLite dB to get a searchable, temporary, index and rely on using some of the most tested software in history.
How much faster is a plain text search really than an regexp without special characters? You'd think this would be quite easy to optimise for a regexp engine.
I admit I try to use -f all the time but your post suddenly made me realise I'd never actually measured the effect. :/
There are a few things to clear up here. Namely, fgrep (equivalent to grep -F) is orthogonal to -f. fgrep/`grep -F` is for "fixed string" search, where as -f is for reading multiple patterns from a file. So `fgrep -f file` means "do a simple literal (non-regexp) search for each string in `file`."
There are two possible reasons why one might want to explicitly use fgrep/`grep -F`: 1) to avoid needing to escape a string that otherwise contains special regex characters and 2) because it permits the search tool to avoid the regexp engine entirely.
(1) is always a valid reason and is actually quite useful because escaping regexes can be a bother. But whether (2) is valid or not depends on whether your search tool is smart enough to recognize a simple literal search and automatically avoid the regex engine. Another layer to this is, of course, whether the regex engine itself is smart enough to handle this case for you automatically. Whether these optimizations are actually applied or not is difficult for a casual user to know. I don't actually know of any tool that doesn't optimize the simplest case (when no special regex features are required and it's just a simple literal search), so it seems to me that one should never use fgrep/`grep -F` for performance reasons alone.
However, if you use the `-f` flag, then you've asked the tool to do multiple string search. Perhaps in this case, the search tool doesn't try as hard to do simple literal optimizations. Indeed, I can actually witness evidence in favor of this guess. The first command takes 15s and the second command takes 10s:
$ cat queries
Sherlock Holmes
John Watson
Professor Moriarty
Irene Adler
grep in this case is GNU grep 2.26. The size of /tmp/OpenSubtitles2016.raw.en is 9.3GB. The only difference between the commands is the presence of the -F switch in the second command. My /tmp is a ramdisk, so the file was already in memory and therefore isn't benchmarking the speed of my disk. The corpus can be downloaded here (warning, multiple GB): http://opus.lingfil.uu.se/OpenSubtitles2016/mono/OpenSubtitl...
Interestingly, performing a similar test using BSD grep shows no differences in the execution time, which suggests BSD grep isn't doing anything smart even when it knows it has only literals (and I say this because BSD grep is outrageously slow).
As a small plug, ripgrep is four times faster than GNU grep on this test and has no difference whether you pass -F or not.
(This is only scratching the surface of literal optimizations that a search tool can do. For example, a good search tool will search for `foo` when matching the regex `\w+foo\d+` before ever entering the regex engine itself.)
Oh dear, you appear to be correct. Adding additional queries to `queries` (while being careful not to increase total match count by much) appears to increase search time linearly. From that, it looks like BSD grep is just running an additional pass over each line for each query.
(Sorry, this is mildly off-topic.) Not sure if this fits your usecase, but you should check out codesearch if you haven't already: https://github.com/google/codesearch
I've also always gotten everything to run with just xargs and minimal scripting.
For my taste GNU parallel gets recommended a bit too quickly e.g. on Stackoverflow when the standard tools would do just fine. Your linked SO question is a prime example of that, xargs specific question yet there's a response that dismisses it entirely and suggests parallel when xargs can easily do the task at hand.
I've taken to using gnu parallel over xargs at all opportunities because I find it much easier to use. I find the parallel commands much shorter and easier to understand. Added to the fact it's more featureful and I find it pretty much everywhere already, why not?
>I'd bet on GNU xargs being more likely to be installed than GNU Parallel.
Your bet is irrelevant in the face of widely implemented standards. POSIX will outlive GNU coreutils. I'd bet your system also probably has a package manager that makes it easy to mark parallel as a dependency of your software.
>many non-Linux systems
Linux is pretty much the only system that typically ships GNU coreutils. The only one that comes to mind is Hurd.
For use within scripts, I think it's a much easier matter to determine that parallel is on $PATH than to try to do feature detection on xargs. Granted, it's always possible that the parallel on $PATH might not be the one you're thinking of, but... Whatevs.
> Granted, it's always possible that the parallel on $PATH might not be the one you're thinking of, but... Whatevs.
For exactly this reason, GNU Parallel has the option --minversion.
So in your script you put:
parallel --minversion 20140722 || exit
if the rest of your script depends on functionality only present from version 20140722.
If the parallel in the $PATH is not GNU Parallel it will fail (and thus exit). If it _is_ GNU Parallel it will fail if the version is < 20140722 and succeed otherwise.
xargs is fucked up when it comes to handling spaces in filenames and such. In fact, this is pretty much the only reason I use GNU parallel — I rarely need to actually run stuff in parallel (but it usually doesn't hurt either), but I need to do something pretty complicated (list | grep | sort | uniq | feed it to feh / whatever) over multiple arguments, without having to worry if these arguments contain spaces, quotes, unicode symbols, etc.
Oh yeah, great I idea. Now insert sorting somewhere between the pipes and use at least one filename with '"' character. Good luck with your xargs. parallel handles all this automatically, you don't even have to know there might be any problems with escaping here.
I guess this is very much a case of YMMV. Virtually all my parallel use cases involve copying files to and from remote machines and starting processes there.
To be honest the same can be told about many GNU tools. At least myself I still experience this moment of being totally lost infront of the man screen from time to time. Parallel has a nice tutorial https://www.gnu.org/software/parallel/parallel_tutorial.html Have you seen it?
I always get the feeling that most man pages are written for people who already know how to use the commands, rather than people new to them. For example, if I already understand how tar works and have a general idea of how to use it, man tar is great to drill down and find specific options and switches that I need or don't remember, but if I have no idea what tar files are or why would I want them, the man page really doesn't help much in explaining things.
I guess you could argue this is supposed to be so and man pages are doing their job, since they are documentation and not tutorials, but still.
> I always get the feeling that most man pages are written for people who already know how to use the commands, rather than people new to them.
This is actually true. In some of the early Unix documentation or an interview (I can't remember which), one of the core Unix developers said pretty much this, i.e., that man pages were more like a collection of rough notes to help the developers of the Unix system itself not to forget how the commands should be invoked.
I don't get it. The info pages usually have way more information available, and are proper software manuals with plenty of worked out examples, voluminous descriptive text, and hyperlinks to related resources. Try `info sed` and compare that with the bare breakdown of command line options and basic syntax of `man sed`.
I'm not sure what that comic has to do with anything.
Man pages are (in my experience) almost always viewed on a text console or terminal. Info pages work poorly there, and it's just frustrating to be looking for some information in a man page and have to start up some other program with a different UX to complete the task.
Man page standards have places for "See also" and "Examples" and that is good enough.
I'm sorry if you got the idea that EXAMPLE section is lacking in parallel manpage.
I wasn't checking that, I was ranting on the general state of man pages. I checked ls, curl, wget, mv - none of which had EXAMPLE sections, and that's where I wrote the comment. Didn't even have parallel on my system.
The page with differences to other tools will be moved to 'man parallel_aternatives' in next version. It made sense when the section was short, but it has grown so big that it can make it harder to navigate.
Thanks for input.
(The trick to find the examples: LESS=+/EXAMPLE: man parallel )
It's a GNU project. It may have a texinfo manual from which the manpage is generated; if so, the info version will have hyperlinks and the like. (But you're probably better off reading the HTML version which should be, as you say, in /usr/share/doc.)
pinfo is better than info for viewing info pages, but I agree: nobody looks at info pages. It's much easier to google documentation than to navigate to it in info.
I try to convince my teams to use wiki for the first draft and then export to whatever tool you prefer for delivery versions. Sometimes it works, sometimes it doesn't.
A lot of effort has gone into the documentation of GNU Parallel: Intro videos, a one hour tutorial, tons of examples, a man page on the design decisions behind the code, and even a "Reader's guide" in the man-page.
Can you be a bit more specific why you believe it is a "brick-wall-in-your-face"? Did you follow the "Reader's guide", which is before even the first option is introduced?
Except that is not entirely true: For instance according to the author the '{= perl expression =}' will probably never be supported.
(Full disclosure: I am the author of GNU Parallel. I fully support building other parallelizing tools, but to avoid user confusion, I would recommend calling them something other than 'parallel' if they are not actually compatible with GNU Parallel).
Being written in Rust is not its main feature (and I think you're being down-voted because people don't like fanboys).
This project is cool because it has a really low overhead, which is really cool if you want to parallelize tasks that are not really CPU intensive (but mostly useless if the CPU usage of each task is high).
Rust-parallel _is_ fast, and there is clearly a niche here, that GNU Parallel is unlikely to fill: By design GNU Parallel will never require a compiler; this is so you can use GNU Parallel on old systems with no compilers (Think an old, dusty AIX-box that people have forgotten the root password to). This design decision limits how fast GNU Parallel can be compared to compiled alternatives.
But the main problem with rust-parallel is that it is not compatible with GNU Parallel (and according to the author, it probably never will be 100% compatible). If you use rust-parallel to walk through GNU Parallel's tutorial (man parallel_tutorial) you will see it fails very quickly.
(Full disclosure: I am the author of GNU Parallel. I fully support building other parallelizing tools, but to avoid user confusion, I would recommend calling them something other than 'parallel' if they are not actually compatible with GNU Parallel. History has shown that using the same name will lead to a lot of unnecessary grief: e.g. GNU Parallel vs. Parallel from moreutils).
Developer here. Noticed significant traffic coming from this domain. Not sure if this specific comment is the cause of that though. Kind of difficult to search this site.
Anyway, I did and am continuing to develop this with Rust as one of it's biggest points. It's a cool project to showcase Rust to both users and other software developers. I can use it to teach others Rust by example with well documented code, and possibly gain some buzz for this great language.
That the application has low overhead is mainly a side effect of choosing to develop this application in a low level language like Rust, but the speed is indeed my biggest selling point in general, for now at least. I'm continuing to commit further development into mitigating costs as much as possible.
I can also point out at there is a benefit to using this over the GNU Perl application for large scale computational tasks. Tasks where you are processing hundreds of thousands to hundreds of millions of inputs. This will keep memory consumption low and reduce the amount of CPU cycles required to process those inputs. You could save anywhere from a few minutes to an hour.
Does anyone else remember Slashdot and the endless threads about Beowulf clusters? That was back when "parallel computing" was overly complicated and rather opaque. And most of us had no idea how to take advantage of multiple machines.
parallel is nice, in the past I would run a small script like https://gist.github.com/CMAD/3077918 because I had to migrate email accounts from a list and it was all very old servers with the package manager broken, good old times, now I will have to run some super convoluted orchestration
The parallel example will definitely not work as you expect, and will likely result in most duplicate lines still being present. When using --pipe this way, if you don't declare --block to be the size of the file (in which case there's no benefit to using parallel), each parallel execution will be run on a separate 1MB (default --block size) chunk of the file before outputting results all separately, then together in a single group (stdout), to the output file.
If you're looking to spread work across CPUs and correctly get the desired output, I'd do something like:
Yeah, but to me, this "misuse" actually makes the input and output clearer than using "<" and ">" together. After all, it places input on the left side of the command, and output on the right side.
parallel -j 1000 -a urls.txt {wget --recursive {}|lynx --dump|(other unix shell power tools) > url.txt}
gets you a powerfull web scraper that maxes out your internet connection.
Throw some machine learning on top of it and you can do a lot with that. Can't tell you what exactly I did with the ML that being trade secret about to get a patent, but the rest doesn't involve the parallel anyway.
Am I the only person who finds GNU parallel way too complicated? I tried to perform a very easy parallel task with it and spent hours reading the documentation and various tutorials. If a person with Unix command-line skills can't easily pick it up, what's the point of having it?
xargs -n1 -P4
Would be at most one arg from the arg list run with 4 jobs. http://stackoverflow.com/questions/28357997/running-programs...