IndexTank / 80legs Crawlathon (developer contest)

va_coder · on June 15, 2011

IndexTank as opposed to WebSolr has a proprietary api. I don't think you can easily move the data to your own servers like you can with WebSolr

diego · on June 15, 2011

Yes you can! Our indexes are persisted in Lucene format internally. Nobody has asked to download an index but if they asked we'd make it available.

personalcompute · on June 15, 2011

Another problem is 80legs will only crawl pages <100kb with the free/contest plan. This means it is useless for all but the smallest pages.

jdrock · on June 15, 2011

Er.. not true, the contest plan (which is different from the free plan) allows up to 10 MB download. Registered contestants should have access to their plans within an hour of registering.

And if you don't have it, just contact us: http://www.80legs.com/contact.html.

personalcompute · on June 16, 2011

Ah, awesome, thank you for the correction, this allows me to continue with my plan. I said that within an hour of registering, but had asked on their support page as well.

revorad · on June 15, 2011

If I'm building a search engine, say focused on telescopes, can I use 80legs to crawl youtube videos (or results from google video search)? I mean can I add this url to my list of urls to crawl? - http://www.youtube.com/results?search_query=telescope

jdrock · on June 15, 2011

Our default crawler obeys robots.txt and it looks like the /results URLs are not allowed. However.. I think you could start from a URL like http://www.youtube.com/watch?v=sAzhOSbxMiI and then crawl to the linked videos from there...

revorad · on June 15, 2011

Oh I didn't know about the robots.txt rule for /results. Thanks for pointing it out or I would have built my own crawler and got banned. I think I'll go with playlists.

orborde · on June 15, 2011

What's the goal of the contest? "Build the most awesome app on our platform"?

It's not clear to me reading the post.

diego · on June 15, 2011

Create a small web search engine. Crawl stuff with 80 legs, index it with IndexTank. Example:

- Crawl everything you can find about bitcoin and do a bitcoin search engine (just popped up in my mind, may not be the most interesting idea).

personalcompute · on June 15, 2011

Then post the bitcoin-related whatever to HN. Instant winner.

btucker · on June 15, 2011

Is there any documentation on how to get started using these two products together?

diego · on June 15, 2011

It should be pretty straightforward:

- run a crawl at 80legs - download the results as a csv or xml - feed them to IndexTank using your client of choice

Over the weekend I'll put on my (rusty) hacker hat, do an example and blog about it.

jdrock · on June 15, 2011

Specific documentation for 80legs is available at http://wiki.80legs.com. To get fancy, you may want to check out http://wiki.80legs.com/80apps to make your own custom extractors. Note that any custom extractors you write will output files in binary in 80legs, so you'll need to convert the byte array to .txt, .csv or .xml or whatever format you want.

(We post custom extractor results as binary because you can also return stuff like images!)

Karhan · on June 16, 2011

Is there a good reason that reddit has banned 80legs crawling?

jdrock · on June 16, 2011

True story: We crawled them a while back (before they expanded their engineering team) and because of our distributed system, "alarms" were going off. Rather then take time to tell their system we were not a DDOS attack, they put us in robots.txt. I imagine their small team had other stuff to work on.

xtacy · on June 17, 2011

So, 80legs caused more traffic volume to reddit than Google crawler? That seems interesting; why would it be the case?

jdrock · on June 17, 2011

Not necessarily more traffic, though that's possible. More likely it was the fact that it was coming from multiple IP addresses.

Omnipresent · on June 15, 2011

can't sign up yet. Seems to be down

80legs is currently down for maintanence.

We're working on upgrades, new releases, and other fun stuff! Be back soon!

jdrock · on June 15, 2011

Hm.. can you try again? Seems to be working from our end!