Hi guys. We were actually about to do an "Ask HN: Review our startup" post, but I guess someone beat us to it.
So, please review our startup. :)
We are launching the beta today to a handful of users and will be letting in more and more users over time.
One other note: We don't just offer crawling. Our model is actually to allow you to analyze the web content that you discover. Using your own custom code that you push into 80legs, you can do sophisticated text processing, image processing, look inside PDFs, etc.
Sorry for hijacking your plan to post here first, but i found the idea incredibly cool and useful, and didn't know you guys are around here on HN. 80 legs can potentially save a lot of effort for people/companies who need to crawl web for data and analyze it.
Hope this really works out well for all of you.
p.s. Just in case you are curious, i got a reference note about your application from someone who was following Web 2.0 Expo.
Thanks for the nice comments luckystrike. We had a great time at the Web 2.0 Expo and we've been overwhelmed by all the interest in our service. We're pretty hopeful that we've built something that people want. :)
It looks pretty cool and even something I might be interested in using.
However my initial feedback relates to clarity.. on the About page you say 'crawl 2bn pages' then 'pay $2 per million pages crawled' and don't actually say that - as I understand it - customers set up custom searches matching a regexp and only pay for the hits (crawls) that match.
My immediate reaction on seeing '2bn'/'$2 per million' was to firstly think 'wtf, $4k per day then?' and secondly 'I hope that's an American not a British billion'. (Though we seem to have adopted yours, now.)
It's really just a wording/clarity thing though, and I might be alone in this.
Thanks, we will work on the wording. Just for clarity, we actually do custom crawls based on your needs. If you need to access one million pages, you tell us how to get to them and pay us $2 plus any time you spend processing those pages. You can do a generic crawl from http://dir.yahoo.com or you can give us a very customized seed list and just read those pages or crawl only a few levels deep from there. Your choice.
You certainly don't need to crawl two billion (2,000,000,000) pages per day. In fact, that's our total estimated capacity right now.
2000000 * 40KB(Average compressed) / (10241024) 0.1 = 7.6 $. That's alone the cost to transfer it to your datacenter. I can'r reliably access the data on remote clients.
I guess the 2$ pricing tag is just marketing blah blha
Our service actually allows you to push your code into the system rather than trying to pull back all of the page contents. So, you end up running your semantic analysis, image analysis, or whatever you want to do on our grid. Very specifically, you implement a processPage() function of the following form:
byte[] processPage ( String url, byte[] pageContents, Object userData); (EDIT: remove code tag that didn't work...)
We run your function on the contents of the pages/images/objects you want to analyze and give you back your results from the millions or billions of pages you want to analyze.
The results from the processPage() function are completely free form. You serialize your results into a byte array and that's what you get back (except you get it back for all of your urls).
Now, since the processPage() function is free form, you can just turn around and "return pageContents;" from your function. That will give you all of the page contents from your crawl. That's not an ideal case for us, but we can handle it. We might eventually charge a small bandwidth or storage cost for this type of usage, but we do not intend to do so for our normal use case.
The bigger charge to the customer if they try to pull back all of the contents will be their local bandwidth charge. They would need to pull all of these pages' contents to their own servers. That will cost them quite a lot of bandwidth assuming they don't have their own fat pipe.
In summary, $2/million-pages-crawled is our real price and is not just marketing.
That's pretty cool. Thinking aloud then, if I wanted to say pull out all the adjectives from results matching $foo, I'd end up getting that data back and then have to pipe that into storage myself - costing me both bandwidth in and bandwidth out. Thought about cutting out the middleman and letting people write to S3 direct? (Yes, I have no idea how complicated this might be.)
Hey - I work for 80legs as well so thought I'd chime in and answer this question (westside is grabbing some food). We have thought about offering easy integration with AWS, but we'd probably implement this at a later time if we decided to go that route.
How do you (and/or Plura) deal with the problem of running code on other people's machines? How do you know that the data being sent back is valid, or that a competitor can't start a node and reverse-engineer your code? This may be less of an issue than I imagine, but I'm sure it's something you've thought about so I'd be interested in hearing your thoughts.
Great question. We've actually done a lot of work on this to ensure that there isn't a problem with running the code on various people's machines.
First, Plura actually runs the processPage() function in the restricted java sandbox so there is no way to actually see any data on the user's computer or do anything bad to their computer. Also, the code goes through a short verification process before it is deployed.
For the results, we do have a reasonably sophisticated validation process as well. For someone to change results from one node, they would have to do quite a bit of work.
Interesting, it's a botnet! From the FAQ: "How can the prices be so low?" "Plura pays developers to embed lightweight widgets in their desktop applications or websites. These widgets harness the idle and excess bandwidth and computing power on the computers of people using the applications and websites."
Plura affiliates actually accept responsibility for getting the permission of their users. Plura encourages disclosure and has found that it is actually very well received by the users once it is explained. It always works our better for Plura affiliates when they disclose. To that end, Plura has actually changed it's TOS with affiliates so that they directly take responsibility for getting user acceptance.
Most Plura apps/websites give users optin/optout capabilities. Rather than anything ill-intentioned, the actual model is really that Plura gives application developers a means of offering their application at a discount (or free) to users that don't mind trading their excess computer resources for the app. For those that don't want Plura+free, the application developer can give them other options (pay, ads, whatever).
Once the users really understand it, they are almost always happy that the developer has a new means of monetization so that the developer will continue to improve the software they are using.
BTW, this all runs in a secure java sandbox where nothing can actually see the users data, disk, what programs are running, or anything else about the computer. Plura has gone to great lengths to try to sanitize the entire process and be good guys.
I prefer shady botnets being used for stuff like this rather than sending and posting spam, but I'd guess that most of the 50k users who have installed this have no idea that they have and what it does.
I didn't look very hard to find it, but is there a list of places/apps that install this?
I looked at the Plura website, and it seems they mostly target institutions (Schools and such) that want to make money off of their spare cycles. So it's not like they're directly installing stuff on the machines of unwitting fools; they pay for processor time.
EDIT: wow, never mind, I totally misunderstood their model:
So, please review our startup. :)
We are launching the beta today to a handful of users and will be letting in more and more users over time.
One other note: We don't just offer crawling. Our model is actually to allow you to analyze the web content that you discover. Using your own custom code that you push into 80legs, you can do sophisticated text processing, image processing, look inside PDFs, etc.