Really cool library. I'm excited to take it for a spin! I liked that there was some work done already for Wikipedia. But as a note to people who want to work with Wikipedia data, it's not very hard to abstract your stuff to work with most wikis based on the MediaWiki platform. I've added a pull request to this project that also supports using the hundreds of thousands of wikis on Wikia. ( https://github.com/clips/pattern/pull/17 )
Well, Google "screen scrapes" millions of websites. They bet their business on it and they seem to be doing OK.
If you upload something to an http server connected to the public internet on tcp/80, and you don't exclude the path to it in robots.txt, then should anyone be surprised if it is copied? HTTP clients don't read TOS.
If Google had to read and interpret every every website's TOS, I doubt they could easily, if at all, produce an index the size of the one they have. It seems by ignoring a "fear of scraping" they managed to produce something valuable that the courts seem to side with in spite of offended copyright holders.
Moreover there's no requirement for them to make their "cache" publicly accessible. But they do. And again this has held up in court quite well. I doubt anyone would be surprised that people are using it. Or "scraping" it if you want to play word games.
The source code has a base SearchEngine class and a Result class, which streamline the input and output parameters across different web services, so it is not difficult to add new services (= subclass of SearchEngine). There are also some general developer docs: http://www.clips.ua.ac.be/pages/pattern-dev