Hacker News new | past | comments | ask | show | jobs | submit login
Why GNU grep is fast (2010) (freebsd.org)
209 points by goranmoomin on July 24, 2022 | hide | past | favorite | 62 comments



See also https://blog.burntsushi.net/ripgrep which contrasts ripgrep with GNU grep and others (from 2016)


Ripgrep has always been interesting to me as I don't ever find myself bothered by the speeds of GNU grep, even when working with large files. Additionally, grep is a standard utility included on most Unix-like OS-es so it is not super risky to write a script that relies on grep -- in contrast to writing a script that relies on a not-usually-installed-by-default tool like ripgrep. For me, I just don't have issues with grep!

I'd love to hear people's experiences on how grep wasn't adequate and why they use ripgrep instead.

(This is not a criticism of Ripgrep: I'm glad it exists and that other people find it useful.)


For me the big advantage of ripgrep is it defaults to searching recursively so I can just do "rg term".

And the plugins support, which enables something like ripgrep-all, which can then search PDFs, etc.

If I'm scripting, though, I try to stick to common denominator grep.


This, and also it ignores irrelevant files. It has sane defaults but you can tweak this with a .rgignore file, which is like .gitignore but for rg. By the way, it will use .gitignore files in a git directory.

That means that by default, it will take a lot less time and won't ruin your terminal when lines of some generated files (especially minified ones that are all on one line) match your search.

This is the main reason I use ripgrep.


> By the way, it will use .gitignore files in a git directory.

If I'm in a repo, I'm using `git grep`.

That makes `rg` a mostly redundant tool for me since it's optimized for searching source code. I can't really use it as a general purpose replacement for grep since if it doesn't find anything I'm left wondering whether what I'm searching is not really there or whether `rg` just didn't bother to check. Even with `--no-ignore --all`, I'm still not sure whether it searches everything. It's one of those tools that I find is too clever for my own good.

So when `git grep` doesn't cover my use case, my fall back is `find | grep` which contains no magic and I know exactly what it's searching.


In particular, ’rg ‐uuu’ should search the same exact content as ’grep ‐r’.


Thank you. This inspired me to read the ripgrep man page. I think I must have been confusing ripgrep's behavior with the earlier ack (or maybe ag) tool which only searched known extensions by default. I see now that ripgrep only does that if explicitly given a --type option.


Whoa, I didn't even know about `git grep`. Sure enough, `man git-grep` has a bunch of relevant info! Thanks for sharing this.

I feel like there are a billion features in git (like this) that I don't know about.


It’s handy if you want to search more than one repo at once. I believe you can do ‘rg -uuu’ as a shorthand to disable all ignores


This. Also you can do stuff like «rg —python myvariable» to search python files only. Neat for multi-language directory trees. (Works with many other languages.)


Do you know about “git grep”? It’s like grep but only greps files tracked by git. Very useful.


Why use a git-specific tool when a generic one (rg) works just as well, even from a parent directory?

I often use rg to search code across repositories so it's useful to have that regardless of grep and git


Often a directory contains a mix of source files (in git) and output files (usually in .gitignore). But maybe rg automatically handles this?


It does


I didn't, seems worth giving a look.


> It has sane defaults but you can tweak this with a .rgignore file, which is like .gitignore but for rg. By the way, it will use .gitignore files in a git directory.

Fwiw there’s also a « .ignore » semi-standard which works with several tools, and not just greps e.g. fd also respects it by default.


I use ripgrep all the time but I sometimes don't trust the results and I have been bitten too many times by heuristics it automatically uses to detect binary files and skip searching into those. These days I run ripgrep with alias `rg --color always --no-mmap -a` but still I think its binary file detection is wonky. I might be missing some relevant option but out of box - IMO grep is slower but always does what is told (and hence more reliable).


Can you give a specific example of where the binary file detection is wonky? It should be basically the same as what GNU grep does. It just looks for a NUL byte. If it exists, it's classified as binary data. Otherwise, text.

GNU grep also does binary detection by default. You have to opt into -a/--text there too. So maybe GNU grep doesn't always do what it's told either. :-)


Ah, that makes sense. I'm glad you brought that up because I didn't even notice that I'm just used to adding -R to my grep commands when I need recursive searching.

I can totally see how that would be a small, but impactful difference.


Whenever you normally use some programs with other options than their defaults, it is simpler to define aliases for those programs.

There are many common programs that I never use with their standard default options (which are very bad, IMO), e.g. cp, mv, ln, rm, rsync, date and many others, so I always define aliases for them, which include those options that I want to use by default.

So for grep, the recursive search should be included in the grep alias. There is no need for a new program in order to have this feature.


> Whenever you normally use some programs with other options than their defaults, it is simpler to define aliases for those programs.

I don't buy into this. These aliases tend to come at the cost, or at least the risk, that your workflow breaks when you are at another computer or working on a shell on some server that doesn't have this alias. That's why I like additional aliases, like l for ls, but with your favorite options. But I dislike aliases that change default behavior - and often in an intransparent way.


ripgrep adds more than "recursive search by default." And it specifically adds things that you cannot easily put into an alias.


Unicode support. It might have been the Windows ports of grep that were the problem, but ripgrep shines with Unicode files. And it handles a mix of Unicode and ASCII files without issue.

And I totally agree that having grep installed everywhere and it's pretty fast enough. But I had a few ripgrep searches that were genuinely eyeblink fast. Like my finger hadn't fully lifted off the enter key and it was done. On 10K+ plus files, about 1 GB, with 1.5M+ LOC. And the default folder recursion and .gitignore handling is a plus.


On the filesizes I was working with (log files) both silver surfer and GNU grep would lock up and crash. Ripgrep handled the same thing in _seconds_. The difference in speed is staggering you may not think the speed bothers you, but I can't go back after installing ripgrep. Its the difference between my mind wandering waiting for a search to complete, versus instantly seeing the results and not losing a train of thought.


Have you worked in a monorepo? There can be 100kloc or more, easily, as well as tens or hundreds of gigabytes for build artifacts/ compilation artifacts, etc that you'll want to skip over.

For scripts I'll still use grep sometimes for the portability reason, naturally.


I have not worked in a monorepo. But that seems to be a great place to use `ripgrep`.


That's probably the #2 thing for me. The other is that I have a `~/workspace` where I put all of my projects and sometimes I ripgrep through there.


It doesn't even have to be a monorepo to see the speed difference. In Emacs I frequently invoke a thing where it searches my codebase as I type. With ripgrep, the results update almost instantaneously. ag, the silver searcher, is the second fastest thing I've used, but there would be a noticeable lag in updating the results as I typed, even for smaller repos.


What's the emacs thing? I use deadgrep but your thing sounds better.


My hunch is consult-ripgrep or counsel-rg.


consult-ripgrep as the other commenter said. Before that it was helm-rg.


ripgrep absolutely tears through some large monoorepos we have at work, far faster than GNU grep.

I imagine the performance difference is even more startling for anyone on a Mac who hasn't replaced BSD grep with GNU grep (install the gnu tools from homebrew and alias "grep" to "ggrep", the performance difference is huge).


I have not felt the need to use ripgrep either. If there was some set of directories and files that I had to search recursively and grep was not fast enough, then I would find a way to reduce the size/number of directories and files that I am searching. That is not always an easy problem to solve, but I believe it is the problem most worthy of solving. To me, size matters. Small has its advantages.

I use computers with resource constraints.^1 I have grep in multicall/crunched binaries. To use ripgrep I would have to make include it as a separate binary. Is there a solution similar to crunchgen for making crunched Rust binaries.

I am not sure if the "ripgrep" name is a joke or the author is serious. Assuming the later, I am content to wait for the BSD and Linux projects I use to switch from C to Rust and from BSD/GNU grep to ripgrep, at which point I would imagine it will simply be called "grep". For portability.

1. This may be why I have less need for ripgrep. I try to keep things small. Keeping things small routinely has the deirable side effect of making things relatively fast.


Can ripgrep replace grep? Nope. See: https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#pos...

What does the "rip" in ripgrep mean? Not "rest in peace." See: https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#wha...

If you only search small corpora, then ripgrep's speed benefits obviously don't matter. There's unlikely to be material differentiation among grep tools in those cases. So it's a priori not a concern for you.

ripgrep has other benefits, but they aren't quite as universally compelling as its speed benefits. For example, when searching large corpora, pretty much everyone is going to appreciate a search taking 1 second vs 10 seconds. But many fewer people are going to appreciate, say, automatic transcoding from UTF-16 in order to search data.

Other than that, ripgrep's "smart" filtering by default would be its main benefit. My first link above address that.


The rip in ripgrep is meant to mean fast (as in ripping through the files), not rest in peace FWIW. It’s unlikely ripgrep will ever replace Greg as it has no intention of implementing 100% compatibility.


The main use of ripgrep for me is that it's not posix and therefore its interface is not terrible


Haha I actually like the interfaces of most POSIX compliant tools. But maybe I've not looked to critically at the interfaces. Do you have any examples that are particularly bad in your opinion?


`find`. Every part of that interface is terrible. Thank goodness for fd


I find I'm installing third party CLI tools anyway, for xsv and jq as the real irreplacable but not yet standard Unix tools. Once I'm having to do that, stuff like ripgrep, fzf or fdfind get added for having a nice UI, even if they wouldn't cross the essential threshold otherwise.


Author of ripgrep here.

There are two ways to talk about "speed." On the one hand, we have the "speed" that is associated with the user experience. That is, given some problem the user cares about, how fast can you solve it? On the other hand, we have the "speed" that is associated with doing precisely the same task and measuring which tool does it faster.

ripgrep is generally faster at both of those things, but it's the former where it really shines:

    $ git remote -v
    origin  git@github.com:nwjs/chromium.src (fetch)
    origin  git@github.com:nwjs/chromium.src (push)

    $ git rev-parse HEAD
    5d32cab40f738932eddc017980e2e409c5abef2c

    $ time rg 'Xvfb and Openbox' | wc -l
    1

    real    0.289
    user    1.526
    sys     1.731
    maxmem  87 MB
    faults  0

    $ time grep -r 'Xvfb and Openbox' ./ | wc -l
    1

    real    5.405
    user    3.489
    sys     1.890
    maxmem  11 MB
    faults  0
(I ran these commands multiple times each until the times stabilized. i.e., The directory tree is in cache.)

We're talking about an order of magnitude improvement here to get the same results. And not just in a "ripgrep took 10ms and grep took 100ms, but both are fast enough" sense. This is the difference between "near instant results" and "this is taking annoyingly long."

Now of course, from the perspective of the second idea of speed, this isn't an apple-to-apples comparison. GNU grep is actually searching a lot more data here. We can make ripgrep search the same amount of data as GNU grep quite easily:

    $ time rg -uuu 'Xvfb and Openbox' | wc -l
    1

    real    2.538
    user    2.570
    sys     3.017
    maxmem  72 MB
    faults  0
So, still a big improvement over GNU grep, but it's not quite as jaw dropping.

The important bit here is that a lot of people care about the improvement at the UX level. You can, for example, get a fair bit of improvement with GNU grep with some extra flags:

    $ time grep -r --exclude-dir='.git' 'Xvfb and Openbox' ./ | wc -l
    1

    real    1.630
    user    0.781
    sys     0.826
    maxmem  11 MB
    faults  0
And now you've got to shove that stuff into an alias or a wrapper script. Which... is fine. I did it for a very long time before I wrote ripgrep. I had a whole bunch of aliases and wrapper scripts, many of which were specific to certain types of projects. But once I built ripgrep, all of those aliases and wrapper scripts went away. Because ripgrep's heuristics for smart filtering by default subsumed all of them.

Finally, it's worth pointing out that ripgrep isn't intended to replace grep. It literally can't. It's not POSIX compatible. So if you're writing shell scripts and care more about portability, then 'grep' is a fine choice. Indeed, I still use 'grep' for precisely that purpose. See: https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#pos...


This is an awesome explanation! Along with other comments I feel like I now understand where I would (and would not) want to use ripgrep.

And thanks for making an awesome, open source tool. :)


Yeah, it's actually an interesting case study in performance optimization - GNU grep puts all this effort into optimizing the performance characteristics of the system calls it uses based on deep kernel knowledge, but ripgrep is orders of magnitude faster for many users via the simple trick of "completely ignore a lot of files by default"


That's not really what's happening in the article. If you read through the single file benchmark, you'll see several clever algorithmic improvements (like rarest byte guessing, building a set of variants for Unicode-aware multiple pattern matching, etc...).

The author literally concedes that the .gitignore feature was not done for performance, and actually carries a significant overhead in large directory trees. For the sake of comparability, the study was controlled for the .gitignore overhead.


> simple trick of "completely ignore a lot of files by

The author of rg wrote a blog post about this. According to what I recall, he did performance comparisons on same limitations and scope. So it's not like in that benchmark, the difference would be due to an obvious fact as this.


This is very very very wrong. GNU grep is not doing any optimizations based on "deep kernel knowledge" that ripgrep doesn't do. I'm honestly not even sure what you're referring to. GNU grep uses standard 'read' syscalls. ripgrep does that too (but also uses memory maps in some cases). There is some buffer size tuning, but otherwise, nothing particularly interesting there.

ripgrep's speed might come from ignoring files in any given use case, and it might even be the biggest reason why a search completes faster. But in my linked blog post, I control for all of that. Yes, while ripgrep might be faster in some cases because of its "smart" filtering, it's also faster in cases where "smart" filtering isn't enabled.


Sounds like both techniques could be used together.


I'd forgotten to upgrade grep on my latest Macbook. In a large source code repo, looking for a fixed string with no wildcards:

osx grep -r:

  real 2m37.786s
  user 2m27.034s
  sys 0m3.958s
ggrep -r:

  real 0m12.842s
  user 0m5.754s
  sys 0m2.825s
which is the difference between something I'd avoid and something I'll use.


Did you try setting the locale for OSX grep?

    LC_CTYPE=C
https://news.ycombinator.com/item?id=4841168


Did you make sure to clear the filesystem cache between each run?


The cache was warm for both measurements, which is more typical of my dev environment.


A CS professor of mine said, “You can’t make a computer run faster; you can only make it do less.”

That’s not entirely true these days due to things like thermal throttling, but it’s still a great way to think about performance.


If it does less, then it won’t thermal throttle.


Plus, if you drive it into thermal throttling then it will do less, like good old prof said.



grep still runs a regex processor though, which I think by default is a deterministic finite-state machine. I had the impression that even when only fixed strings are used one would still need to use -F to get to full speed. For fixed strings Boyer-Moore is the obvious choice.


You can pretty easily empirically test this and conclude that your impression is not true, at least for GNU grep. :-)

    $ time grep 'ZQZQZQZQZQ' OpenSubtitles2018.raw.en | wc -l
    0

    real    1.089
    user    0.230
    sys     0.858
    maxmem  5 MB
    faults  0

    $ time grep -E 'ZQZQZQZQZQ' OpenSubtitles2018.raw.en | wc -l
    0

    real    1.094
    user    0.210
    sys     0.883
    maxmem  5 MB
    faults  0

    $ time grep -F 'ZQZQZQZQZQ' OpenSubtitles2018.raw.en | wc -l
    0

    real    1.096
    user    0.223
    sys     0.872
    maxmem  5 MB
    faults  0


Love his summary.

> The key to making programs fast is to make them do practically nothing. ;-)

It is a great way to think about performance.

Is there any way that I can write code such that I avoid this work altogether...?


Btw, the GNU awk is also considerably faster (due to bytecode) than awk bundled with BSD & Mac (usually, "one true awk").


Depends on the task...sometimes nawk ("one true awk") is faster than gawk. Mawk is almost always faster than either.

An older, but good article on that: https://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-...


> The key to making programs fast is to make them do practically nothing. ;-)

This is going straight to my Anki quotes collection :)


This is a good article, but note that some recent implementations skip the whole Boyer-Moore machinery and don't seem to suffer for it https://lobste.rs/s/ycydmd/why_gnu_grep_is_fast_2010#c_gpim7.... Minor self-promotion, I grabbed that link from my page on string-matching, https://justinblank.com/notebooks/stringmatching.html.


Wow, lobste.rs has a mod log... :)

https://lobste.rs/moderations




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: