Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Oh Good Grep Web Grepper: A New Web Intelligence Feature From Blekko (searchengineland.com)
58 points by krishna2 on Sept 14, 2011 | hide | past | favorite | 17 comments


I was super excited until I read that you have to submit requests to be voted on. While I understand the difficulty (impossibility?) of having this kind of service on-demand, I really would rather not have to submit my obscure and possibly business-intelligence related queries to a community for voting. This could be a game-changing service if they could somehow make it on-demand.


It is on-demand. And no. of votes is not a pre-requisite (it is however helpful though). However, we do have want to keep an eye out on the number of grep-jobs we can run a day without affecting other services' performance. You should submit your grep!

PS: I am one of the engineers who built wegbrep


I second the OP. I had the "no f'ing way!" expression as I was reading the post until I read the part about making a request. Big downer :(


Thanks for the feedback both you and the OP. We will make this more clear.


Here is an example web grep that I ran to find the top ranked sites that used kissmetrics.com for user tracking. Last month there was a huge blow-up over kissmetrics possibly using ETAGs and other hacks to track users across multiple sites.

https://blekko.com/webgrep?page=view&id=3f469c08de300c12...


That grep is too generic to imply that someone was using it vs just linking to kissmetrics.com.


Agreed, If you know the exact js file or code snippet that kissmetrics requires - then it would be an even better grep. Wanna take a stab?


This looks very cool, but what I was really hoping for, is way to finally enter a regular expression in a search box and get results back.

Looks like I'll need to wait a bit longer.


Grep is a mapjob which takes hours to run. You'll be waiting quite a while until anyone can afford to quickly run regex queries against billions of documents! And by then, there will be hundreds of billions of documents.


Google Code Search does do regexes against an impressively large set of documents nearly instantly, though it's clearly much smaller than the set of all webpages. It'd be interesting to know how much Google could scale it; could they handle 100x the number of documents in the current code search? 10,000x?


One thing you should note is that Google Code Search, as far as I know, supports regular expressions that are actually regular. This means you can't have an expression like /(ab..)\1/, for example.

In all, re2, the regular expression engine that Google Code uses, is a very interesting project; you should read about it on its google code page: http://code.google.com/p/re2/.


The issue is not so much how much cpu time the regex evaluation takes up, it's the I/O time of loading every byte of every page we've crawled.

That being said, re2 does look pretty cool... having a guarantee that nothing in an re can blow up is pretty nice, on top of the overall speed improvement.


We do plan to eventually support regexes but slowly introduce them with limited-support and not jump head-first with PCRE. Of course, this is only within the scope of webgrepper that I am talking about.

PS: I am one of the engineers who built wegbrep


I guess the more accurate motto "Fgrep the Web" would be off-putting even for geeks!


How many sites with <script> X installed is amazing! Very cool feature.


Cool, wondering what people are going to grep for


Here is a list of Greps that have already completed: http://blekko.com/webgrep?status=completed




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: