I'd be more interested in seeing an infrastructure for scraping where the API functions are fixed, but the actual scraping functions are dynamically loaded so that the API user doesn't have to maintain it or re-pull/re-fork/re-compile when the website design changes.
Bonus points if one can make an ORM out of it, e.g.
for article in get_api('reddit.com').todayilearned.filter('new').limit(100):
... do something ...
Where a call to get_api() dynamically fetches the latest scraping functions, in case reddit's page design has changed.
Triple bonus points if the system can be designed in a de-centralized fashion to defend against ToSes that try to disrciminate between human eyes and machine eyes.
I created http://scrape.ly with some of those points described. It's a bit of journey of getting there and I'd love to know in more detail about what you have in mind.
The API generation with Blockspring and Kimonolab is very nicely done but I like to focus solely on the web scraping more, as it represents very difficult set of challenges.
Bonus points if one can make an ORM out of it, e.g.
Where a call to get_api() dynamically fetches the latest scraping functions, in case reddit's page design has changed.Triple bonus points if the system can be designed in a de-centralized fashion to defend against ToSes that try to disrciminate between human eyes and machine eyes.