It looks pretty cool and even something I might be interested in using. However ...

westside1506 · on April 8, 2009

Thanks, we will work on the wording. Just for clarity, we actually do custom crawls based on your needs. If you need to access one million pages, you tell us how to get to them and pay us $2 plus any time you spend processing those pages. You can do a generic crawl from http://dir.yahoo.com or you can give us a very customized seed list and just read those pages or crawl only a few levels deep from there. Your choice.

You certainly don't need to crawl two billion (2,000,000,000) pages per day. In fact, that's our total estimated capacity right now.

miracle · on April 8, 2009

What about traffic?

2000000 * 40KB(Average compressed) / (10241024) 0.1 = 7.6 $. That's alone the cost to transfer it to your datacenter. I can'r reliably access the data on remote clients.

I guess the 2$ pricing tag is just marketing blah blha

westside1506 · on April 8, 2009

Our service actually allows you to push your code into the system rather than trying to pull back all of the page contents. So, you end up running your semantic analysis, image analysis, or whatever you want to do on our grid. Very specifically, you implement a processPage() function of the following form:

byte[] processPage ( String url, byte[] pageContents, Object userData); (EDIT: remove code tag that didn't work...)

We run your function on the contents of the pages/images/objects you want to analyze and give you back your results from the millions or billions of pages you want to analyze.

The results from the processPage() function are completely free form. You serialize your results into a byte array and that's what you get back (except you get it back for all of your urls).

Now, since the processPage() function is free form, you can just turn around and "return pageContents;" from your function. That will give you all of the page contents from your crawl. That's not an ideal case for us, but we can handle it. We might eventually charge a small bandwidth or storage cost for this type of usage, but we do not intend to do so for our normal use case.

The bigger charge to the customer if they try to pull back all of the contents will be their local bandwidth charge. They would need to pull all of these pages' contents to their own servers. That will cost them quite a lot of bandwidth assuming they don't have their own fat pipe.

In summary, $2/million-pages-crawled is our real price and is not just marketing.

jlees · on April 8, 2009

That's pretty cool. Thinking aloud then, if I wanted to say pull out all the adjectives from results matching $foo, I'd end up getting that data back and then have to pipe that into storage myself - costing me both bandwidth in and bandwidth out. Thought about cutting out the middleman and letting people write to S3 direct? (Yes, I have no idea how complicated this might be.)

jdrock · on April 8, 2009

Hey - I work for 80legs as well so thought I'd chime in and answer this question (westside is grabbing some food). We have thought about offering easy integration with AWS, but we'd probably implement this at a later time if we decided to go that route.