Hacker News new | past | comments | ask | show | jobs | submit login
80legs sets its web crawler free (venturebeat.com)
74 points by raghus on Dec 22, 2009 | hide | past | favorite | 21 comments



I love the idea behind 80legs, and their Plura program is a great way to help monetize webgames, but I just can't get past their interface.

Too many of the things that I want to do require custom code- And while it's great that they support me uploading custom code right in the window, the implementation makes it pretty difficult.

For one, since all the code needs to be manually reviewed, I can't rapidly iterate. Granted, I'm only a mediocre coder, but the way I tend to program with a new API is to write a Hello World, then expand it outward. Add in one feature, test, expand it to another, until it does most of what I want. Then, I send it through a sample-set of data, and if it works, I'm good to go. There's no way to do this with the 80legs program.

The second problem is that the code has to be in Java. I like Java. I use Java on my own projects. But java isn't a very fun language. I don't know many people who wake up in the morning excited because they get to write Java apps. The thing is, the JVM supports a multitude of scripting languages.. Give us access to them. Let me upload ruby code, with an idea, click-and-run against some sample data, and see what happens.

I have a old test project I worked on a year ago, that identifies images and tracks their sources.. Something like 80 legs would be perfect to "seed" it, and fill it up with data. But figuring out how to make it work is too much of a hassle, versus doing the crawling myself, and the speed difference isn't worth it to me. If it gets done in a week versus 5 weeks, it doesn't make change very much in the end.

In any event, I do wish them well. It's a very innovative program, and with some implementation work, I think it could do very well.


While the JVM does support many languages, as far as we know, these languages all require reflection to work. Unfortunately, reflection is prohibited by the security policy Plura has in place to protect the home PCs Plura-based apps use.

If anyone knows how to make JRuby, Jython, etc work in the JVM without reflection, we would love to hear it.


>> "But java isn't a very fun language."

No language is "fun". It's what you create with the language that is fun :)

The worst thing about the tech industry is the constant language wars.


Sure, that's a fair point, but ensuring type safety is something I do on projects because I'm afraid of the long-term consequences if I don't, not because it makes me all giddy inside.


Some languages are inherently fun, take Logo for instance :)


I think it's a confusing title: It's now free under 100K pages. But that costed 0.2$ before; hardly a hurdle. If you want tot spider 110K pages, before it would have cost $0.21. Now it suddenly is $100/month. Doesn't seem free to me, or a better deal.


I really enjoyed 80legs and used them a lot, until they changed their pricing structure.

Crawling-as-a-service makes sense. Thousands upon thousands of people need crawling as a service.

Crawling on a subscription basis? Not so much. How many organizations are just crawling crawling crawling, and need to do so all the time?

Regardless, I wish that I had been grandfathered in to the old pricing structure. I've been using 80legs since the beginning, and have been an advocate for it from day one. It really sucks that, having helped promote the service, I am now forced to get a cut-back in affordable service.


They acknowledged to change and add a lower-entry subscription plan. Nonetheless, still seems like a step backwards indeed.


Usually when you're going to do a price hike like this you want to have some other big news to offset the fact that if readers read between the lines they realize they are paying ridiculously more today than yesterday for your service. The fact that they're wrapping it into this "free" banner just makes it even more disingenuous.

Better would have been "80legs is changing our pricing structure. But, as a gift to everyone who has been using our original (albeit short-sighted) pricing structure, anyone on the old pricing structure will continue to be able to use it until Q1 2011." or something. Along with a feature release, this would have avoided the impending blowback once people understand what's really changed.


I really wish they had grandfathered me in to the old pricing.


Their service would take off much more if they offered a Python, Ruby, or JavaScript API.


No.

The service would take off much more if instead of defining search patterns as regular expressions they were defined as jquery style expressions that acknowledged DOM and allow you to find all <title> tags that exist in the <header>. Yes you can do this with regexp, but parsing HTML shouldn't be a regexp task.

Oh, I'd like to see email gateways too... point a stream of emails at it and parse those. I'm thinking of scenarios like tripit.com taking in tons of different emails and parsing them to extract travel info.


I'm building something right now that includes page parsing, and so far I've only been building in regex support. I like your jQuery selector idea as well, are there any other ways that you can think of that would make searching the contents of a page programmatically easier for you?


May I suggest taking a look at Parsely? Its the syntax they use on www.parselets.com. The documentation for implementing it in your own apps is a little sparse, but the data format is awesome. Here's one that describes scraping HN:

http://parselets.com/parselets/yc/14

Might not be a fit for your project, but in terms of describing parsing instructions to a crawler its the best format I've ever seen.


I'm not crawling, but that is pretty interesting looking. I'll bookmark it and take a look at it for later for sure - thanks!


Hpricot for Ruby is great.

For instance, parsing a Google search results page:

        (doc/"a.l").each do |link|
            label = link.inner_text
            href = link.attributes['href']
            ...


This is something we can include as an 80App :) Thanks for the suggestion!


I like how the registration page includes Hacker News as an option for the "where did you hear about us?" question...


We actually get the most traffic and signups when an article about us is posted on HN.


I think the title is confusing. I was expecting to see an article about 80legs releasing open source code.


Actually, a lot of the code for our 80Apps is open source.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: