Pattern - Web Mining Python lib

languagehacker · on Oct 14, 2012

Really cool library. I'm excited to take it for a spin! I liked that there was some work done already for Wikipedia. But as a note to people who want to work with Wikipedia data, it's not very hard to abstract your stuff to work with most wikis based on the MediaWiki platform. I've added a pull request to this project that also supports using the hundreds of thousands of wikis on Wikia. ( https://github.com/clips/pattern/pull/17 )

interro · on Oct 14, 2012

More info http://www.clips.ua.ac.be/pages/pattern

knes · on Oct 14, 2012

I'm a big fan of data mining so I'll make sure to take it out for spin :) And from fellow belgian people, nice!

salimmadjd · on Oct 14, 2012

This is awesome! Any plans to add other sites, like amazon, yelp, tripadvisor, etc!

irskep · on Oct 14, 2012

NB: screen-scraping Yelp is against the TOS and you'll get shut down pretty fast if you try it.

Terretta · on Oct 14, 2012

Exactly. Screen scrape Google search results instead, I've heard that works great. Bet your business model on it, I've heard. ;-)

bigthingnext · on Oct 15, 2012

Well, Google "screen scrapes" millions of websites. They bet their business on it and they seem to be doing OK.

If you upload something to an http server connected to the public internet on tcp/80, and you don't exclude the path to it in robots.txt, then should anyone be surprised if it is copied? HTTP clients don't read TOS.

If Google had to read and interpret every every website's TOS, I doubt they could easily, if at all, produce an index the size of the one they have. It seems by ignoring a "fear of scraping" they managed to produce something valuable that the courts seem to side with in spite of offended copyright holders.

Moreover there's no requirement for them to make their "cache" publicly accessible. But they do. And again this has held up in court quite well. I doubt anyone would be surprised that people are using it. Or "scraping" it if you want to play word games.

salimmadjd · on Oct 16, 2012

I see you worked at yelp before so I will not elaborate more, but it's not too hard to scrape yelp if you know what you are doing

enjo · on Oct 14, 2012

Better yet: Is there a well defined structure for other folks to add that stuff?

fooandbarify · on Oct 14, 2012

I'm not one of the authors, but the code in question is all in one file: https://github.com/clips/pattern/blob/master/pattern/web/__i...

It would be fairly straightforward to add your own class.

a235 · on Oct 15, 2012

Whoa, modules with dozen of classes and 2450 lines of code! They definitely don't rely on file-navigation.

Project is great, but I've got a feeling that many smallish components have been already implemented, but here they were rewritten from scratch.

rabidsnail · on Oct 14, 2012

Is it just me or are they rerlolling requests and lxml for no good reason?

tomdesmedt · on Oct 15, 2012

The source code has a base SearchEngine class and a Result class, which streamline the input and output parameters across different web services, so it is not difficult to add new services (= subclass of SearchEngine). There are also some general developer docs: http://www.clips.ua.ac.be/pages/pattern-dev

mkumm · on Oct 14, 2012

This looks pretty interesting, I will give it a go