Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Google Code Search does do regexes against an impressively large set of documents nearly instantly, though it's clearly much smaller than the set of all webpages. It'd be interesting to know how much Google could scale it; could they handle 100x the number of documents in the current code search? 10,000x?


One thing you should note is that Google Code Search, as far as I know, supports regular expressions that are actually regular. This means you can't have an expression like /(ab..)\1/, for example.

In all, re2, the regular expression engine that Google Code uses, is a very interesting project; you should read about it on its google code page: http://code.google.com/p/re2/.


The issue is not so much how much cpu time the regex evaluation takes up, it's the I/O time of loading every byte of every page we've crawled.

That being said, re2 does look pretty cool... having a guarantee that nothing in an re can blow up is pretty nice, on top of the overall speed improvement.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: