We've invested very heavily in building out a solid infrastructure for extractin...

spacefight · on Sept 24, 2014

"Our goal here is to be able to distinguish between the good type and bad type of scraping and give webmasters full transparency. Obviously this is a hard problem. If you have any feedback on any of this we'd love to hear it."

Yes as said before plus:

- Obey robots.txt to the full extend

- Name your access, i.e. label your bot

- Don't use shady tactics such as IP rotation

- Provide web site owners the option to fully block access of your bots (yes, communicate your full IP ranges)

Again - this is from a content owner who paid for his content.

jrochkind1 · on Sept 24, 2014

Neither of us are lawyers (as far as I know), and I assume you have legal counsel for a business like this, and I wish you luck in your business and hope it doesn't come to anything legal.

Actually I hope even more it does come to something legal and you win, because I'd love to expand and make concrete fair use rights for scraping. I like scraping, scraping is both fun and very useful for the business domain I work in, and very frustrating when content providers don't allow it by either terms of service (which may or may not be legally enforceable if you haven't agreed to them, it's unclear, but scary enough with all the CFAA over-enforcement) or technological protections.

But I think you're being disingenous about the difference between a bot and an interactive web browser, I think it's pretty straightforward to most people and will be to the courts if it comes to that.

Interestingly, the latest enhanced Amazon anti-bot protections I ran into say "To discuss automated access to Amazon data please contact...", but don't explicitly try to say "you are forbidden from automated access."

jahewson · on Sept 24, 2014

It's fair to say that robots.txt is a balancing act in this case, given it's intended use. However, a website's terms of use are non-negotiable. Clauses banning any form of automated access or data gathering (especially for non-personal use) are fairly popular amongst sites with "deny everything" robots.txt files. There's a very real risk here for both you and your customers.

In the long run it'd be nice to see some sort of "fair access" to websites introduced into law, unfortunately we don't let live in that world.