If you use grep in OS X, and you use it a lot (even called through other programs/libraries) it pays in time to install GNU grep vs. the default OS X provided one. It's several times faster. Ran into this when benchmarking grep + GNU Parallel on OS X vs. Linux in my thesis (raftlib modeling) work. Results on my small MacPro cluster didn't match those on Linux. The OS X runs were orders of magnitude slower despite almost identical hardware configurations. The culprit as it turned out was largely the implementation of grep. I compiled with the GNU version, re-ran and the rest of the mis-match was simply due to the lack of POSIX features like thread pinning, and NUMA control.
<stepping up to soap box/> So yeah...as usual, algorithm matters. Implementation matters, and well....knowledge of the architecture/data-movement matter. These days the architecture/data-movement knowledge matters almost as much or more than the algorithm. Picking an algorithmically faster implementation that moves data more (e.g., tree traversals usually have poor cache behavior, more bursts required, etc.) is often the case b/c programmers don't really understand how the computer works, but are great at coding O(n logn) algorithms. We shouldn't stop teaching algorithms, but data-movement/access likely needs to be emphasized along with the understanding of algorithms.
Absolutely, more or less grep -> ack -> ag. Be warned tho, sometimes ag (and IIRC ack as well, but haven't used it in a while since switching to ag) won't find some results due to its default filtering rules even tho certain results should be included. For critical things, I end up double-checking results w/ grep -r which always finds everything, no questions asked.
Which in this case is ridiculous. There are valid reasons for businesses to avoid GPL, but if they use GNU grep, the only thing they have to open source is their copy of GNU grep. It's not like doing so will force them to make their kernel GPL
> There are valid reasons for businesses to avoid GPL, but if they use GNU grep, the only thing they have to open source is their copy of GNU grep.
If they use GNU grep as part of a distribution that might be included with hardware, it means the GPLv3 anti-tivoization clauses affect any hardware that that distribution is on. Apple being first and foremost a consumer hardware manufacturer whose software exists to add value to that hardware, and wanting to maintain its ability to do what it wants with that hardware including not being constrained by the GPLv3's anti-tivoization clause (even if they aren't doing anything now that would fall afoul of it, they might want to in the future, and don't want to tie their hands in advance), has a very good reason not to touch GPLv3 at all.
(There's other GPLv3 provisions that might be problematic, too, GPLv3 involves more than just a requirement to provide source also licensed under the GPLv3.)
Companies that want to remain in ownership control of property after they have sold it can't use GPLv3 licensed software, that is true. They could lease out the hardware to achieve the same effect, but for some reason some companies prefer to disguise the lease as a sale.
> They could lease out the hardware to achieve the same effect
No, they could not. The GPLv3 requires modification information be provided to the "user" by whom a consumer product is "received", not only the "owner" by whom it is "purchased", so lease vs. sell has no effect on the obligations (at least no effect clear from the text of the license.)
Reading the sources to that, it seems the issue is extremly unclear. The big question is if you need additional copyright permission to lease out hardware that include software.
For example. A car rental service buys a modern car that has software in it. Does they need additional copyright permission on top of the purchase in order to lease the car out?
A taxi driver buys a modern car with the intention to rent out his service with the car. Is additional copyright permission required?
A school provides students with access to the school computers. Is that conveying a copy of software? What if its "open door day", where the public is invited?
A public library want to have public available computers. Are they conveying copies of software through that?
If you looked at WIPO, it explicitly say no. If you looked at Information Society Directive 2001/29/EC, they explicitly say yes, under the condition that the offer is directed to the public. If you looked at the court cases that has interpreted those texts, then they are inconclusive.
> Reading the sources to that, it seems the issue is extremly unclear. The big question is if you need additional copyright permission to lease out hardware that include software.
If its intended as a "Consumer Product" as defined in the license, you need copyright permission to put the copy of the software in the product; at that point you either except the condition of providing the required information and functionality to the user to whom you will eventually provide the product -- whether by rental or sale -- or you violate the GPLv3 by putting the copy in the device.
Once you have accepted that condition, you are bound by it.
> For example. A car rental service buys a modern car that has software in it. Does they need additional copyright permission on top of the purchase in order to lease the car out?
They don't need "additional copyright permission", but if the initial copyright license accepted when acquiring the car has restrictions that apply to the use of the software in rentals, then they are bound by those restrictions.
I'd assume their policy is because of the Tivoization clauses, which (as far as I understand them) would become an issue if they lock the OS down more, or want to share the environment with e.g. iOS devices.
If you couldn't, it would conflict with the "anti-tivoization" rules applicable to "User Products" in the GPLv3, which requires that any GPLv3 code incorporated into a "User Product" is replaceable with modified versions, and that the manufacturer of the product provide all information necessary to the user to make that replacement.
Is there anything stopping you from booting into Linux to rename it, is there? Like how I can't delete files owned by SYSTEM on Windows, but can on Linux?
Yes. Linux has no support for Core Storage, which is now used in default macOS layouts. So any hfsplus volume on Core Storage is rendered effectively not readable on an OS other than macOS, even if it's not encrypted.
Oh, I certainly could (and do for bash), but I believe the GPL v3 requires that you can replace any copies of programs with your own, which possibly Apple might worry they fall foul of, which might be why they don't want GPL v3 software (I'm guessing).
Actually, a strict "no GPLv3" policy makes a lot of sense because it side steps the whole issue and means they don't have to fork over a ton of money for legal reviews of something 95% of OSX users won't care about.
It's easy enough for people who want GNU grep to install it themselves.
grep was already there from BSD, so no reason to add something new. However, speaking as a guy who started out on Sun BSD machines, the very first thing we did when we got a new box was to replace all the BSD utilities with GNU ones. Virtually all of them are better.
FYI: OS X's grep is FreeBSD's bsdgrep. However, FreeBSD's default grep is still an ancient version of GNU grep (bsdgrep is only available as bsdgrep.).
Also, older versions of OS X used to have GNU grep [1]. So the more likely explanation is Apple's purge of GPL code from OS X. Since later grep versions use GPLv3, GNU grep was a dead end for them.
Unfortunately, FreeBSD's bsdgrep has some very annoying bugs (that are also present in OS X). No one seems to care about them, even when a patch is provided [2].
The patch in [2] was added a bit over a month ago as a comment in a bug report about the GNU grep in the base system, so I don't think it's a good example of no one caring about bsdgrep bugs.
On occasion I search for bsdgrep bugs, and did not find the comment / patch in [2]. Thank you for pointing it out, I'll look at it when I can.
The patch in [2] was added a bit over a month ago as a comment in a bug report about the GNU grep in the base system, so I don't think it's a good example of no one caring about bsdgrep bugs.
I FreeBSD developer requested that I added a report to that bug [1]. After adding the patch to the report, I also reported this to the list [2]. I have also tried to contact the maintainer of bsdgrep with no luck. At some point you just give up ;).
i've seen this getting worse with your typical developer in the age of "push the button to get another virtual machine and someone else will pay the bill."
1842 days is over 5 years. At some point you have to consider the set of users here now is different to (if not distinct from) the set of users here 5 years ago.
Correct. I've read the original post years ago, I do find it it fascinating, and I would like every new generation to read it. I don't have a quarrel with it making HN front page every now and then.
That's what separates good and bad teams (or individual developers): the wisdom of how to make a program that does the bare minimum while actually solving the problem.
He is saying make the program spend less time computing. Thats what he means by not looking at every byte. The way he does that is by using more complicated algorithms. It's not simple.
Quote is often credited to Einstein; being "simple" does not mean something is not complex, it means that there's no unnecessary complexity; being "too simple" means necessary complexity was remove.
Basically means only do exactly what's required, nothing more, nothing less.
I used to work at a place that had an intentionally slow dsl line, and a ~5 year old, beat up pc to test their code on. If it worked there, it would work for their customers.
I wish more places would try this. Sure, your app works in downtown SF, but if your customer is in Nowheresville, AL, you should probably spend more time optimizing.
The difference between 3G & 4G both in speed and latency is profound, and nobody seems to design with 3G in mind, so I find from a data standpoint 3G is practically worse than nothing these days.
All of your apps will say "I have data access!" and send a million requests for tens of megabytes, which will never finish, melting your battery as it waits on timeouts and completely choking the access you do have preventing you from doing anything useful.
Not really necessary in most cases. There's enough profiling tools available that can give response times in fractions of a second and in many cases you can add repetition to inflate the few processes that are still too small to measure. These will give you far more meaningful figures about the performance of your code than anecdotal evidence using a slow PC as a benchmark.
I agree with both of you. Once I optimized a major Python project to avoid some bad disk access patterns. Nobody noticed them because everyone was using SSDs and those patterns didn't hurt (too much) performance on those, but I was running on an HDD and I had run times of hours (vs a couple of minutes for everyone else).
But my optimizations were guided by the profiler and a (more or less) correct analysis of the results.
Which is one of a hundred reasons why the machine you write the code on, the machine you build it on, and the machine you test it on, should be three different machines (actually probably more than that because your test environment should be more varied than a single OS/environment).
If we really want to adhere to best working practices, then it should be a separate team performing the tests based on the same specifications that the developers used to write the code (ie the testing team wouldn't perform tests based on the developers specifications but rather the project owners specs to mitigate the risk of the developers misunderstanding the specs).
However that requires a level of process that is either too costly, or too rigid for many IT companies to follow.
Yes and we underestimate how often this kind of rule can apply.
To have fast network, reduces network utilisation (for example using cache). To have fast database, reduce number of queries (for example using more complex queries). To have fast memory allocation, avoid dynamic memory or preallocate...
Somehow this reminds me of this quote from Antoine de Saint-Exupéry:
"Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away."
This is actually why I find myself writing fewer bugs when I'm writing Java than Python. Thanks to static typing, I don't need to endlessly repeat myself writing and updating boilerplate unit tests and boilerplate documentation (and inevitably messing up)
I've heard about PEP484 and hope it catches on, but unfortunately it's for python3 and a lot of people are still stuck on python2 for many reasons.
Too true. I read an account of some product being ported to F# (from C#) a few years back. The line count was substantially decreased (at least one order of magnitude if I'm remembering correctly) with a correlated decrease in defects. I can't find the original post, but there are plenty of supporting testimonials[1].
Yeah, as someone who gets to use F# fairly regularly, actually, it's a substantial improvement in the LOC department compared to Java or C#. I don't find it quite as reliable as something like Haskell (do to the fact that you don't have to be explicit about side-effects), but it's definitely better than most languages.
I am not so sure the 70/8 tradeoff is worth it but it should probably be tested with code samples as well as english corpuses. Is grep primarily made for ascii?
Really? My completely anecdotal experience is that ag kills grep. Stuff like searching usages of variables in Linux kernel seems to take ages with grep, whereas ag search on the whole repository doesn't take annoyingly long.
I'll need to test this, just to see how wrong my "feeling" on the matter may be.
I'm the author of ag. On a single file, ag is slower than GNU grep. The main reason is that ag prints line numbers when showing matches. Printing line numbers means going through the whole file and searching for \n's. For large files, this can be very resource-intensive. Here are the numbers (median of 5 runs, caches hot) for an 8GB file:
% time ag hello4294967296 big_file.txt
134217728:hello4294967296
ag hello4294967296 big_file.txt 7.83s user 0.44s system 99% cpu 8.275 total
% time grep hello4294967296 big_file.txt
hello4294967296
grep hello4294967296 big_file.txt 2.17s user 0.91s system 99% cpu 3.085 total
If you enable line numbers with grep -n, grep still wins, probably because ag's printing code is ridiculously inefficient:
% time grep -n hello4294967296 big_file.txt
134217728:hello4294967296
grep -n hello4294967296 big_file.txt 3.37s user 0.97s system 99% cpu 4.338 total
Unfortunately, I didn't write ag's printing code very sanely. It grew organically from users asking for small features here and there, and now it's a giant hairball. I've added some tests to reduce the chance of breaking changes, but it'll probably be a while before I fix this performance issue.
Despite these limitations, ag tends to be much faster than grep -r in real-world use. There are several reasons why:
1. Ag has a multi-threaded architecture that can search multiple files at once. Grep is single-threaded.
2. Ag ignores binary and hidden files by default. Grep doesn't.
3. Ag obeys .gitignore/agignore/hgignore. Grep doesn't.
These things really add up. For example, when searching my 20GB ~/code directory:
ggreer@lithium:~/code% time ag instantiationname
...
ag instantiationname 3.82s user 2.16s system 157% cpu 3.797 total
ggreer@lithium:~/code% time grep -n -i -r instantiationname
...
grep -n -i -r instantiationname 42.77s user 2.62s system 99% cpu 45.403 total
Both find the same matches (some tests in the Node.js source), but ag does it 12x faster.
I guess that explains my experience really well, since I'm always searching from a big amount of small files like huge codebases and config files, and almost never from files bigger than several megabytes.
ag has tremendously improved my quality of life in that regard, so please accept my sincere Thank You for making ag :)
> Ag has a multi-threaded architecture that can search multiple files
Interesting. I would have guessed you were i/o bound there, and file systems usually give higher throughput on sequential reads. Tried it on my bog standard Linux box but couldn't get it close to grep performance wise. Can you share how you did it?
> Ag ignores binary and hidden files by default
Like grep -I? That could easily cause unexpected results. Many files are treated as binary just because they contain parts of a non native character set (such as latin-1 files on a utf-8 system).
I've documented most of ag's tricks in various blog posts.[1] You're right that sequential reads are faster than random reads, but searching a bunch of source code is going to be a bunch of short reads no matter what. Each file is on the order of 50kB, or about 0.5ms if read at 100MB/sec. That's far less than the seek time of a hard drive. I've experimented a lot, and found it best to read ASAP, then let the OS & device figure out the best order to service the requests.
Also, modern SSDs (and even HDDs) have tricks like NCQ. With caches cleared (cold reads), multi-threaded ag is 2x faster on SSDs. Disabling NCQ makes multi-threaded perform the same as single-threaded.[2] (Again, with caches cleared.)
> … That could easily cause unexpected results. …
I've received very few complaints about ag's binary detection. It does a nice job of balancing false positives, false negatives, and performance. See is_binary() in src/utils.c for the exact code[3].
> The main reason is that ag prints line numbers when showing matches.
Wouldn't it be more accurate to say that ag is slower than grep because it does line-by-line searching where as grep does not? What would happen if ag adopted grep's method of searching where it doesn't do line-by-line? Is that possible with PCRE?
I think it would be very interesting to combine the tricks used by grep with the tricks used by ag. I don't think it has been done yet!
Ag doesn't search line-by-line. It's only if a match is found that it goes back and looks for newlines. This optimizes the typical case (most files have no matches) at the expense of the rare case (file contains a match).
Also, ag only uses PCRE if the query looks like a regex. Otherwise, it uses Boyer-Moore strstr.
You didn't check to see what value opts.multiline has. It's initialized to TRUE in src/options.c. It's only set to FALSE if a user passes the undocumented --nomultiline option to ag. Default behavior with a regex query is https://github.com/ggreer/the_silver_searcher/blob/master/sr...
Also, unrolling loops may actually slow things down in modern x86 CPUs since they will automatically decode and cache very small loops. Some interesting discussion on that here:
Processor competition in the desktop market has been lackluster since that time. Intel has been sitting on their laurels mostly. The vectored primitives and bigger cache and core counts are nice but honestly you have been fooled by the branding and continual product lines.
Loop unrolling still helps even with brand spankin new CPUs. Some intent cannot be inferred.
> Processor competition in the desktop market has been lackluster since that time. Intel has been sitting on their laurels mostly. The vectored primitives and bigger cache and core counts are nice but honestly you have been fooled by the branding and continual product lines.
I might be inclined to agree with you... if I didn't know that Haswell:
* implemented the computed goto transformation (for interpreter loops) in hardware
* made unaligned cache hits crossing cache line boundaries have no latency penalty
Both of those things are things that I would have told you were impossible(/infeasible) to do in hardware before finding out that Intel did them.
There's actually quite a lot of amazing microarchitectural changes going on in Intel processors, you just don't hear a lot about them.
OT, but along the same lines of how GNU has brought considerable performance improvements to what (in my naive layperson's mentality) seem like straightforward tools:
Does anyone have an insight as to why GNU tail is, in at least one situation, considerably faster than FreeBSD's tail (at least the one that comes with OS X)?
I'm referring to the use of: `tail -n +2` to skip the first line, versus using `sed '1d'`. To me, it makes intuitive sense how the latter could be faster, and on FreeBSD, it is, by at least one order of magnitude. However GNU tail is one order faster than sed. Is GNU tail using some heuristic to handle "skip the first line in a huge file" differently?
I haven't read the code of neither sed nor tail, but I'd imagine sed is still reading line by line, while tail skips the number of lines it needs to skip, and then just goes on full throttle reading the remainder of the file. It might even be using sendfile(2) or splice(2).
GNU grep is fast, but there are certain alternatives that are pretty fast (probably faster):
- ack [0]
- ag (also known as the silver searcher) [1]
You could make `grep` faster by telling it to interpret the pattern as a fixed string (-F or --fixed-strings) is you're not using a regex. Setting to LANG to C would also probably make it faster.
I think it would be interesting to compare those utilities with GNU grep on a single file, because I understand them to be faster on many files because they (1) search files in parallel and (2) skip many uninteresting files automatically.
I also don't think either of those tools are DFA based like GNU grep. One thing I'd like to see is combine many of the performance tricks found in GNU grep with the aforementioned (1)+(2) techniques.
> You could make `grep` faster by telling it to interpret the pattern as a fixed string (-F or --fixed-strings) is you're not using a regex.
That doesn't really make much sense. GNU grep can tell whether a pattern is just a literal or not. If it's a literal, it should be able to stay out of the main DFA engine entirely.
It's pretty easy to use grep and search in parallel while skipping uninteresting files. `git ls-files | xargs -P4 -n8 grep whatever` is a good way to do this; or `git grep`, but that's using PCRE and not parallelized.
> That doesn't really make much sense. GNU grep can tell whether a pattern is just a literal or not. If it's a literal, it should be able to stay out of the main DFA engine entirely.
Really? Pretty sure it does. Have you read the source code? There is a specific part that looks to see if it has a literal and will use that. If it has an exact match, then it will skip the DFA completely.
The regex is compiled anyhow to a C-call subroutine using a lib. So the overhead is only a function call. I highly doubt a single inlined proc is going to give much speed gain.
I use grep for single files (mostly because it is burned in my muscle memory). The silver searcher (from emacs) for bulk searches in a large directory tree (like a codebase).
It is also because it is multithreaded (IIRC). There is another one written in Go that is for sure. GNU grep is faster one file (at least the last time I tested).
It searches code as fast as the_silver_searcher(ag).
It ignores file patterns from your .gitignore.
It searches UTF-8, EUC-JP and Shift_JIS files.
It provides binaries for multi platform (Mac OS X, Windows, Linux).
But even more so, sensible defaults that work nicely for code search use cases. And written in Go, without accumulated cruft for 200 platforms and POSIX subtleties like grep, so even someone like me can hack it to do something special if I want to.
It's also faster (30% to 50% for my codebase), written in Go, and easier to extend.
$ time ag zmq | wc -l
63
real 0m0.017s
user 0m0.022s
sys 0m0.016s
$ time pt zmq | wc -l
296
real 0m0.013s
user 0m0.014s
sys 0m0.017s
(Timing differences consistent over multiple runs of both on the same codebase -- no disc cache effect) -- the wc difference is because of different "surrounding context" setting.
I'm the author of ag. Pt has drawbacks that you may not be aware of. For example: pt is case-sensitive by default. (Ag defaults to smart-casing. If your query is all lowercase, it's case-insensitive.) If you tell pt to do a case-insensitive search, it kills performance. Here's an example:
ggreer@lithium:~/code% time ag instantiationname
...
ag instantiationname 3.82s user 2.16s system 157% cpu 3.797 total
ggreer@lithium:~/code% time pt -i instantiationname
...
pt -i instantiationname 136.65s user 0.77s system 773% cpu 17.761 total
That's to search a 20GB code directory. Both find the same matches. (Instances of "InstantiationName" in node/deps/gtest/include/gtest/gtest-param-test.h)
Another issue is that pt tends to bail on errors. If you tell it to follow symlinks (-f) and it encounters a single broken symlink, it will exit without finishing the search. Every other search tool I know of (grep, ag, ack) will keep on truckin'.
That said, pt is faster than ag for case-sensitive matches. It looks like much of that comes from a parallelized directory traversal. The architecture of ag is different. It has many worker threads, but only one thread going through dirs. Looks like I'll have to step-up my game. :)
I found another pt performance issue. Searching an 8GB file:
% time pt -i hello4294967296 big_file.txt
big_file.txt
134217728:hello4294967296
pt -i hello4294967296 big_file.txt 535.86s user 1.35s system 100% cpu 8:56.97 total
That's right: pt takes 9 minutes to do a case-insensitive match on an 8GB file. On the same machine doing the same search, ag takes 7 seconds and grep takes 4s.
Also, it looks like pt was beating ag because it defaults to case-sensitive search. If I make ag do the same case-sensitive search, it's slightly faster in my benchmarks (0.63s to search my ~/code directory vs pt's 0.65s).
Thanks, original parent, never occurred to me to do case-insensitive searches, but I can see them being useful in some cases.
My main beef with ag was probably due to some older distribution or worse packager (tried it a few years ago), because I tried it again now from "brew" and it's pretty much just as convenient as pt -- and with some more features.
I wonder if all of these optimizations are still as helpful. For example, skipping bytes only saves memory bandwidth if you skip whole cachelines.
Also, for --mmap, there are a couple of conflicting considerations. On newer CPUs, large memory copies are very fast. With mmap, though, unless you use MAP_POPULATE, you can get a page fault every few pages, and on CPUs with exception mechanisms as bad as x86's, page faults are very slow (probably 20 times as bad as a syscall).
Memory bandwidth is not the main concern here since they explicitly mention that executing more CPU instructions kill performance far quicker - in this case. As for mmap, perhaps you would like to present your own benchmark that contradicts the article.
But anyway, I highly doubt that either of the basic things you mentioned are news to the implementors. The implementation is supposed to run on all kinds of processors with varying architectures so making the implementation portable is bound to leave performance on the table. Not to mention other constraints like maintainability and meeting the performance criteria for the common use case/workload.
The bottleneck is not reading the byte, it is the CPU clocks needed to figure out that it is safe to skip.
Assuming cache prefetching works well thanks to the sequential search, you can try to max out the memory bandwidth. But even DDR3 has best-case transfer rates from 6-17 GB/second. So to max it out on a 3GHz CPU, you need to eliminate 2-6 bytes on every tick of the clock.
Does it use SIMD instructions? Some newer cpus can do up to 16 comparisons at once. That also might mean skipping will be branchy and not exactly divisible on word boundaries for best vectorization.
* The Boyer-Moore algorithm it uses has a skip loop, which is used to quickly search for the last byte in the needle before trying to do more checking. Specifically, the skip loop is implemented with memchr, which can be found in your libc and probably compiles down to SIMD instructions. (The trick here isn't actually skipping bytes, but processing many bytes in a single loop iteration.) This optimization works well in the common case, but can slow things down a bit if your skip byte is really common in the haystack.
* When a pattern consists of an alternation of literals, like, `abc|mno|xyz`, then it uses an algorithm based on Commentz-Walter, which you can think of as a hybrid between Aho-Corasick and Boyer-Moore. That is, you get the byte skipping of Boyer-Moore even though you're searching for multiple patterns. The performance of Commentz-Walter degrades as the number of patterns increases because there's fewer opportunities for meaningful skips. Nevertheless, small/short alternations are probably the common case with GNU grep. (The state of the art has actually improved for this specific case and GNU grep could probably benefit. Specifically, the Hyperscan folks over at Intel have some really cool SIMD algorithms for matching multiple patterns. I implemented one here (the comments include a description with examples): https://github.com/rust-lang-nursery/regex/blob/master/src/s...)
* The actual regex engine is a lazy DFA (similar to RE2) in that its states are computed on-demand as needed. That is, some parts of the DFA that aren't used are never computed at all. If the DFA gets too big in memory, the existing states are dropped and re-computed as needed. In this way, GNU grep gets the performance of a DFA while preserving linear time matching. (At most one new DFA state is computed for each byte of input.) The inner DFA loop is unrolled.
* To re-emphazie a point from the post: avoiding line-by-line matching is critical. This makes it much easier to extract literals from a pattern. For example, if your pattern is `\w+foobar\d+`, then GNU grep can still search for `foobar` because it can make an assumption that its haystack is generally a very small slice of the entire search space (i.e., a single line). This type of optimization is much harder to pull off in a general purpose regex engine. A general purpose regex engine might limit itself to looking for prefix or suffix literals only.
* GNU grep really doesn't do all that well for multi-byte encodings. e.g., The time difference between `LC_ALL=C egrep -c '\w{5}'` and `LC_ALL=en_US.UTF-8 egrep -c '\w{5}'` on a large file is quite dramatic. (~3 seconds vs ~52 seconds for a ~2GB mostly-ASCII file that is warm in my page cache.) There's really no need for this slow down if you can bake UTF-8 decoding into your DFA (RE2 does this). However, you then lose the ability to `grep` over other character encodings without transcoding them to UTF-8 first. (In fact, GNU grep may be doing this transcoding step today anyway, so maybe baking UTF-8 into your DFA would be a strictly better solution since it would be much faster on a much larger number of inputs, but probably not slower than any inputs. However, I'm not terribly familiar with this part of GNU grep, so I could have this wrong.)
> GNU grep really doesn't do all that well for multi-byte encodings.
I might be wrong, but I have a recollection that this was massively improved somewhat recently, maybe a year or so back. Are you using a recent-ish version?
This might also be to do with the needle - I think I was testing with a fixed ASCII string, which should work the same in either the UTF-8 or C locales (assuming a UTF-8 file). But \w I guess should behave differently depending on locale, e.g. I'd expect it to match ß in UTF-8 but not C.
> I might be wrong, but I have a recollection that this was massively improved somewhat recently, maybe a year or so back.
I seem to recall hearing that too. A brief skim of the commit log shows multiple possible improvements, but I don't have time to drill down further into that.
It could just be that it used to be worse. :-)
(If you look in the core DFA loop, which is in the `dfaexec_main` function inside `dfa.c`, then you can see that there is a completely different path through the DFA if it needs to use multibyte encodings.)
> But \w I guess should behave differently depending on locale, e.g. I'd expect it to match ß in UTF-8 but not C.
Exactly. But this doesn't have to have a measurable performance impact, assuming the number of matches doesn't radically change. (Well, this is a bit of a lie, because a Unicode aware \w is much bigger than an ASCII-only \w, which implies the DFA needs more memory, which could lead to a more general slowdown if the state cache needs to get emptied more frequently.)
When searching for a literal string, the bottleneck tends to be memory bandwidth. When doing regex searches, the bottleneck is usually CPU. If caches are cold, then disk I/O is the limiting factor. Even in that case, technologies like NCQ allow some degree of concurrency.
If you have ag[1], you can play around with the --workers option to see how various numbers of threads change performance. (The default is for ag to use #CPUs-1 workers.)
Even when disk caches are involved (ie. data in RAM, not cache), a typical grep application (short "needle" to search for) should saturate the memory bandwidth before CPU cores.
Running a few threads/processes in parallel could improve throughput with latency hiding, but adding more shouldn't give any benefit.
I believed busybox's grep to be faster as it was very small, with GNU version slower due to the many features added over the time, but it is quite not true! On a slow device as some evoked said, the difference is highly noticeable.
That's actually an important point. It's easy to misinterpret
"The key to making programs fast is to make them do practically nothing."
Busybox's grep is small and simple. GNU grep is big and complex, which is actually why it's faster. It does a lot of work to avoid doing a lot of work.
Wait, I always tried to make my pattern as short as possible and I thought it would speed up the searches. So I guess this means I'm actually better off searching for the longest possible match then?
Pretty much, yes. Longer strings should also match fewer times than short strings, which should also speed things up (because reporting a match has its own overhead associated with it, like printing text to the terminal).
<stepping up to soap box/> So yeah...as usual, algorithm matters. Implementation matters, and well....knowledge of the architecture/data-movement matter. These days the architecture/data-movement knowledge matters almost as much or more than the algorithm. Picking an algorithmically faster implementation that moves data more (e.g., tree traversals usually have poor cache behavior, more bursts required, etc.) is often the case b/c programmers don't really understand how the computer works, but are great at coding O(n logn) algorithms. We shouldn't stop teaching algorithms, but data-movement/access likely needs to be emphasized along with the understanding of algorithms.