Oh Good Grep Web Grepper: A New Web Intelligence Feature From Blekko

acabal · on Sept 14, 2011

I was super excited until I read that you have to submit requests to be voted on. While I understand the difficulty (impossibility?) of having this kind of service on-demand, I really would rather not have to submit my obscure and possibly business-intelligence related queries to a community for voting. This could be a game-changing service if they could somehow make it on-demand.

krishna2 · on Sept 14, 2011

It is on-demand. And no. of votes is not a pre-requisite (it is however helpful though). However, we do have want to keep an eye out on the number of grep-jobs we can run a day without affecting other services' performance. You should submit your grep!

PS: I am one of the engineers who built wegbrep

badclient · on Sept 14, 2011

I second the OP. I had the "no f'ing way!" expression as I was reading the post until I read the part about making a request. Big downer :(

krishna2 · on Sept 15, 2011

Thanks for the feedback both you and the OP. We will make this more clear.

randomstring · on Sept 14, 2011

Here is an example web grep that I ran to find the top ranked sites that used kissmetrics.com for user tracking. Last month there was a huge blow-up over kissmetrics possibly using ETAGs and other hacks to track users across multiple sites.

https://blekko.com/webgrep?page=view&id=3f469c08de300c12...

pyre · on Sept 14, 2011

That grep is too generic to imply that someone was using it vs just linking to kissmetrics.com.

krishna2 · on Sept 15, 2011

Agreed, If you know the exact js file or code snippet that kissmetrics requires - then it would be an even better grep. Wanna take a stab?

binarymax · on Sept 14, 2011

This looks very cool, but what I was really hoping for, is way to finally enter a regular expression in a search box and get results back.

Looks like I'll need to wait a bit longer.

wumpus · on Sept 14, 2011

Grep is a mapjob which takes hours to run. You'll be waiting quite a while until anyone can afford to quickly run regex queries against billions of documents! And by then, there will be hundreds of billions of documents.

_delirium · on Sept 14, 2011

Google Code Search does do regexes against an impressively large set of documents nearly instantly, though it's clearly much smaller than the set of all webpages. It'd be interesting to know how much Google could scale it; could they handle 100x the number of documents in the current code search? 10,000x?

tikhonj · on Sept 14, 2011

One thing you should note is that Google Code Search, as far as I know, supports regular expressions that are actually regular. This means you can't have an expression like /(ab..)\1/, for example.

In all, re2, the regular expression engine that Google Code uses, is a very interesting project; you should read about it on its google code page: http://code.google.com/p/re2/.

wumpus · on Sept 15, 2011

The issue is not so much how much cpu time the regex evaluation takes up, it's the I/O time of loading every byte of every page we've crawled.

That being said, re2 does look pretty cool... having a guarantee that nothing in an re can blow up is pretty nice, on top of the overall speed improvement.

krishna2 · on Sept 14, 2011

We do plan to eventually support regexes but slowly introduce them with limited-support and not jump head-first with PCRE. Of course, this is only within the scope of webgrepper that I am talking about.

PS: I am one of the engineers who built wegbrep

ez77 · on Sept 14, 2011

I guess the more accurate motto "Fgrep the Web" would be off-putting even for geeks!

diegogomes · on Sept 14, 2011

How many sites with <script> X installed is amazing! Very cool feature.

alukasiewicz · on Sept 14, 2011

Cool, wondering what people are going to grep for

krishna2 · on Sept 14, 2011

Here is a list of Greps that have already completed: http://blekko.com/webgrep?status=completed