Hacker News new | past | comments | ask | show | jobs | submit login
IndexTank / 80legs Crawlathon (developer contest) (indextank.com)
41 points by diego on June 15, 2011 | hide | past | favorite | 20 comments



IndexTank as opposed to WebSolr has a proprietary api. I don't think you can easily move the data to your own servers like you can with WebSolr


Yes you can! Our indexes are persisted in Lucene format internally. Nobody has asked to download an index but if they asked we'd make it available.


Another problem is 80legs will only crawl pages <100kb with the free/contest plan. This means it is useless for all but the smallest pages.


Er.. not true, the contest plan (which is different from the free plan) allows up to 10 MB download. Registered contestants should have access to their plans within an hour of registering.

And if you don't have it, just contact us: http://www.80legs.com/contact.html.


Ah, awesome, thank you for the correction, this allows me to continue with my plan. I said that within an hour of registering, but had asked on their support page as well.


If I'm building a search engine, say focused on telescopes, can I use 80legs to crawl youtube videos (or results from google video search)? I mean can I add this url to my list of urls to crawl? - http://www.youtube.com/results?search_query=telescope


Our default crawler obeys robots.txt and it looks like the /results URLs are not allowed. However.. I think you could start from a URL like http://www.youtube.com/watch?v=sAzhOSbxMiI and then crawl to the linked videos from there...


Oh I didn't know about the robots.txt rule for /results. Thanks for pointing it out or I would have built my own crawler and got banned. I think I'll go with playlists.


What's the goal of the contest? "Build the most awesome app on our platform"?

It's not clear to me reading the post.


Create a small web search engine. Crawl stuff with 80 legs, index it with IndexTank. Example:

- Crawl everything you can find about bitcoin and do a bitcoin search engine (just popped up in my mind, may not be the most interesting idea).


Then post the bitcoin-related whatever to HN. Instant winner.


Is there any documentation on how to get started using these two products together?


It should be pretty straightforward:

- run a crawl at 80legs - download the results as a csv or xml - feed them to IndexTank using your client of choice

Over the weekend I'll put on my (rusty) hacker hat, do an example and blog about it.


Specific documentation for 80legs is available at http://wiki.80legs.com. To get fancy, you may want to check out http://wiki.80legs.com/80apps to make your own custom extractors. Note that any custom extractors you write will output files in binary in 80legs, so you'll need to convert the byte array to .txt, .csv or .xml or whatever format you want.

(We post custom extractor results as binary because you can also return stuff like images!)


Is there a good reason that reddit has banned 80legs crawling?


True story: We crawled them a while back (before they expanded their engineering team) and because of our distributed system, "alarms" were going off. Rather then take time to tell their system we were not a DDOS attack, they put us in robots.txt. I imagine their small team had other stuff to work on.


So, 80legs caused more traffic volume to reddit than Google crawler? That seems interesting; why would it be the case?


Not necessarily more traffic, though that's possible. More likely it was the fact that it was coming from multiple IP addresses.


can't sign up yet. Seems to be down

80legs is currently down for maintanence.

We're working on upgrades, new releases, and other fun stuff! Be back soon!


Hm.. can you try again? Seems to be working from our end!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: