Er.. not true, the contest plan (which is different from the free plan) allows up to 10 MB download. Registered contestants should have access to their plans within an hour of registering.
Ah, awesome, thank you for the correction, this allows me to continue with my plan. I said that within an hour of registering, but had asked on their support page as well.
If I'm building a search engine, say focused on telescopes, can I use 80legs to crawl youtube videos (or results from google video search)? I mean can I add this url to my list of urls to crawl? - http://www.youtube.com/results?search_query=telescope
Our default crawler obeys robots.txt and it looks like the /results URLs are not allowed. However.. I think you could start from a URL like http://www.youtube.com/watch?v=sAzhOSbxMiI and then crawl to the linked videos from there...
Oh I didn't know about the robots.txt rule for /results. Thanks for pointing it out or I would have built my own crawler and got banned. I think I'll go with playlists.
Specific documentation for 80legs is available at http://wiki.80legs.com. To get fancy, you may want to check out http://wiki.80legs.com/80apps to make your own custom extractors. Note that any custom extractors you write will output files in binary in 80legs, so you'll need to convert the byte array to .txt, .csv or .xml or whatever format you want.
(We post custom extractor results as binary because you can also return stuff like images!)
True story: We crawled them a while back (before they expanded their engineering team) and because of our distributed system, "alarms" were going off. Rather then take time to tell their system we were not a DDOS attack, they put us in robots.txt. I imagine their small team had other stuff to work on.