Hacker News new | past | comments | ask | show | jobs | submit login

I'd be more interested in seeing an infrastructure for scraping where the API functions are fixed, but the actual scraping functions are dynamically loaded so that the API user doesn't have to maintain it or re-pull/re-fork/re-compile when the website design changes.

Bonus points if one can make an ORM out of it, e.g.

    for article in get_api('reddit.com').todayilearned.filter('new').limit(100):
        ... do something ...
Where a call to get_api() dynamically fetches the latest scraping functions, in case reddit's page design has changed.

Triple bonus points if the system can be designed in a de-centralized fashion to defend against ToSes that try to disrciminate between human eyes and machine eyes.




Hey dheera that's super interesting. Since it's about web scraping specifically, I bet the guys over at Kimonolabs.com would be the experts on that!

Or if someone on here knows how to do this, and wants to throw up an API on Blockspring for it, we'd all love it :)


I created http://scrape.ly with some of those points described. It's a bit of journey of getting there and I'd love to know in more detail about what you have in mind.

The API generation with Blockspring and Kimonolab is very nicely done but I like to focus solely on the web scraping more, as it represents very difficult set of challenges.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: