Hacker News new | past | comments | ask | show | jobs | submit login

There are a few things to clear up here. Namely, fgrep (equivalent to grep -F) is orthogonal to -f. fgrep/`grep -F` is for "fixed string" search, where as -f is for reading multiple patterns from a file. So `fgrep -f file` means "do a simple literal (non-regexp) search for each string in `file`."

There are two possible reasons why one might want to explicitly use fgrep/`grep -F`: 1) to avoid needing to escape a string that otherwise contains special regex characters and 2) because it permits the search tool to avoid the regexp engine entirely.

(1) is always a valid reason and is actually quite useful because escaping regexes can be a bother. But whether (2) is valid or not depends on whether your search tool is smart enough to recognize a simple literal search and automatically avoid the regex engine. Another layer to this is, of course, whether the regex engine itself is smart enough to handle this case for you automatically. Whether these optimizations are actually applied or not is difficult for a casual user to know. I don't actually know of any tool that doesn't optimize the simplest case (when no special regex features are required and it's just a simple literal search), so it seems to me that one should never use fgrep/`grep -F` for performance reasons alone.

However, if you use the `-f` flag, then you've asked the tool to do multiple string search. Perhaps in this case, the search tool doesn't try as hard to do simple literal optimizations. Indeed, I can actually witness evidence in favor of this guess. The first command takes 15s and the second command takes 10s:

    LC_ALL=C grep -c -f queries /tmp/OpenSubtitles2016.raw.en
    LC_ALL=C grep -c -F -f queries /tmp/OpenSubtitles2016.raw.en
The contents of `queries`:

    $ cat queries 
    Sherlock Holmes
    John Watson
    Professor Moriarty
    Irene Adler
grep in this case is GNU grep 2.26. The size of /tmp/OpenSubtitles2016.raw.en is 9.3GB. The only difference between the commands is the presence of the -F switch in the second command. My /tmp is a ramdisk, so the file was already in memory and therefore isn't benchmarking the speed of my disk. The corpus can be downloaded here (warning, multiple GB): http://opus.lingfil.uu.se/OpenSubtitles2016/mono/OpenSubtitl...

Interestingly, performing a similar test using BSD grep shows no differences in the execution time, which suggests BSD grep isn't doing anything smart even when it knows it has only literals (and I say this because BSD grep is outrageously slow).

As a small plug, ripgrep is four times faster than GNU grep on this test and has no difference whether you pass -F or not.

(This is only scratching the surface of literal optimizations that a search tool can do. For example, a good search tool will search for `foo` when matching the regex `\w+foo\d+` before ever entering the regex engine itself.)




This is a great comment and ripgrep deserves a more prominent plug, clickable: https://github.com/BurntSushi/ripgrep


BSD grep decides on a pattern by pattern basis which match engine to use. -F is unlikely to affect performance.


Oh dear, you appear to be correct. Adding additional queries to `queries` (while being careful not to increase total match count by much) appears to increase search time linearly. From that, it looks like BSD grep is just running an additional pass over each line for each query.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: