80 legs: Web Crawler as a Service

westside1506 · on April 8, 2009

Hi guys. We were actually about to do an "Ask HN: Review our startup" post, but I guess someone beat us to it.

So, please review our startup. :)

We are launching the beta today to a handful of users and will be letting in more and more users over time.

One other note: We don't just offer crawling. Our model is actually to allow you to analyze the web content that you discover. Using your own custom code that you push into 80legs, you can do sophisticated text processing, image processing, look inside PDFs, etc.

luckystrike · on April 8, 2009

Sorry for hijacking your plan to post here first, but i found the idea incredibly cool and useful, and didn't know you guys are around here on HN. 80 legs can potentially save a lot of effort for people/companies who need to crawl web for data and analyze it.

Hope this really works out well for all of you.

p.s. Just in case you are curious, i got a reference note about your application from someone who was following Web 2.0 Expo.

westside1506 · on April 8, 2009

Thanks for the nice comments luckystrike. We had a great time at the Web 2.0 Expo and we've been overwhelmed by all the interest in our service. We're pretty hopeful that we've built something that people want. :)

jlees · on April 8, 2009

It looks pretty cool and even something I might be interested in using.

However my initial feedback relates to clarity.. on the About page you say 'crawl 2bn pages' then 'pay $2 per million pages crawled' and don't actually say that - as I understand it - customers set up custom searches matching a regexp and only pay for the hits (crawls) that match.

My immediate reaction on seeing '2bn'/'$2 per million' was to firstly think 'wtf, $4k per day then?' and secondly 'I hope that's an American not a British billion'. (Though we seem to have adopted yours, now.)

It's really just a wording/clarity thing though, and I might be alone in this.

westside1506 · on April 8, 2009

Thanks, we will work on the wording. Just for clarity, we actually do custom crawls based on your needs. If you need to access one million pages, you tell us how to get to them and pay us $2 plus any time you spend processing those pages. You can do a generic crawl from http://dir.yahoo.com or you can give us a very customized seed list and just read those pages or crawl only a few levels deep from there. Your choice.

You certainly don't need to crawl two billion (2,000,000,000) pages per day. In fact, that's our total estimated capacity right now.

miracle · on April 8, 2009

What about traffic?

2000000 * 40KB(Average compressed) / (10241024) 0.1 = 7.6 $. That's alone the cost to transfer it to your datacenter. I can'r reliably access the data on remote clients.

I guess the 2$ pricing tag is just marketing blah blha

westside1506 · on April 8, 2009

Our service actually allows you to push your code into the system rather than trying to pull back all of the page contents. So, you end up running your semantic analysis, image analysis, or whatever you want to do on our grid. Very specifically, you implement a processPage() function of the following form:

byte[] processPage ( String url, byte[] pageContents, Object userData); (EDIT: remove code tag that didn't work...)

We run your function on the contents of the pages/images/objects you want to analyze and give you back your results from the millions or billions of pages you want to analyze.

The results from the processPage() function are completely free form. You serialize your results into a byte array and that's what you get back (except you get it back for all of your urls).

Now, since the processPage() function is free form, you can just turn around and "return pageContents;" from your function. That will give you all of the page contents from your crawl. That's not an ideal case for us, but we can handle it. We might eventually charge a small bandwidth or storage cost for this type of usage, but we do not intend to do so for our normal use case.

The bigger charge to the customer if they try to pull back all of the contents will be their local bandwidth charge. They would need to pull all of these pages' contents to their own servers. That will cost them quite a lot of bandwidth assuming they don't have their own fat pipe.

In summary, $2/million-pages-crawled is our real price and is not just marketing.

jlees · on April 8, 2009

That's pretty cool. Thinking aloud then, if I wanted to say pull out all the adjectives from results matching $foo, I'd end up getting that data back and then have to pipe that into storage myself - costing me both bandwidth in and bandwidth out. Thought about cutting out the middleman and letting people write to S3 direct? (Yes, I have no idea how complicated this might be.)

jdrock · on April 8, 2009

Hey - I work for 80legs as well so thought I'd chime in and answer this question (westside is grabbing some food). We have thought about offering easy integration with AWS, but we'd probably implement this at a later time if we decided to go that route.

paulgb · on April 8, 2009

This looks very cool.

How do you (and/or Plura) deal with the problem of running code on other people's machines? How do you know that the data being sent back is valid, or that a competitor can't start a node and reverse-engineer your code? This may be less of an issue than I imagine, but I'm sure it's something you've thought about so I'd be interested in hearing your thoughts.

westside1506 · on April 8, 2009

Great question. We've actually done a lot of work on this to ensure that there isn't a problem with running the code on various people's machines.

First, Plura actually runs the processPage() function in the restricted java sandbox so there is no way to actually see any data on the user's computer or do anything bad to their computer. Also, the code goes through a short verification process before it is deployed.

For the results, we do have a reasonably sophisticated validation process as well. For someone to change results from one node, they would have to do quite a bit of work.

mjs · on April 8, 2009

Interesting, it's a botnet! From the FAQ: "How can the prices be so low?" "Plura pays developers to embed lightweight widgets in their desktop applications or websites. These widgets harness the idle and excess bandwidth and computing power on the computers of people using the applications and websites."

westside1506 · on April 8, 2009

Plura affiliates actually accept responsibility for getting the permission of their users. Plura encourages disclosure and has found that it is actually very well received by the users once it is explained. It always works our better for Plura affiliates when they disclose. To that end, Plura has actually changed it's TOS with affiliates so that they directly take responsibility for getting user acceptance.

Most Plura apps/websites give users optin/optout capabilities. Rather than anything ill-intentioned, the actual model is really that Plura gives application developers a means of offering their application at a discount (or free) to users that don't mind trading their excess computer resources for the app. For those that don't want Plura+free, the application developer can give them other options (pay, ads, whatever).

Once the users really understand it, they are almost always happy that the developer has a new means of monetization so that the developer will continue to improve the software they are using.

BTW, this all runs in a secure java sandbox where nothing can actually see the users data, disk, what programs are running, or anything else about the computer. Plura has gone to great lengths to try to sanitize the entire process and be good guys.

kiba · on April 8, 2009

Interesting way to earn money. However, why a regular user can't run a plura client too so that they can earn cash themsleves?

westside1506 · on April 8, 2009

There's certainly no problem with individual users doing it. Just contact Plura through the web form at http://pluraprocessing.com/contact.php.

Alternatively, you can use one of our affiliates to raise money for charity (not related to our company) http://donatebot.com

m_eiman · on April 8, 2009

I prefer shady botnets being used for stuff like this rather than sending and posting spam, but I'd guess that most of the 50k users who have installed this have no idea that they have and what it does.

I didn't look very hard to find it, but is there a list of places/apps that install this?

jibiki · on April 8, 2009

I looked at the Plura website, and it seems they mostly target institutions (Schools and such) that want to make money off of their spare cycles. So it's not like they're directly installing stuff on the machines of unwitting fools; they pay for processor time.

EDIT: wow, never mind, I totally misunderstood their model:

http://www.pluraprocessing.com/technology.html

slackenerny · on April 8, 2009

Inspiring.

Reminds me of old ways in doing cloud computing:

EC2 : http://www.nd.edu/~parasite/

S3 : http://isec.pl/papers/juggling_with_packets.txt

; )

stcredzero · on April 8, 2009

Their datacenter costs might be lower. Hopefully, they can pass that on to their customers.

gojomo · on April 8, 2009

Very interesting service! A number of questions...

What User-Agent do you use?

Do you crawl non-textual resources?

Do you save all headers from the crawled responses?

Do you perform any processing on the returned content (like de-chunking or de-compressing) or can it be retrieved verbatim?

If two customers request the same URL/site be crawled, are their requests merged so the site is only crawled once?

Do you save the exact time of the request (not trusting the returned 'Date' header)?