Hacker News new | past | comments | ask | show | jobs | submit login
Pattern - Web Mining Python lib (github.com/clips)
183 points by interro on Oct 14, 2012 | hide | past | favorite | 14 comments



Really cool library. I'm excited to take it for a spin! I liked that there was some work done already for Wikipedia. But as a note to people who want to work with Wikipedia data, it's not very hard to abstract your stuff to work with most wikis based on the MediaWiki platform. I've added a pull request to this project that also supports using the hundreds of thousands of wikis on Wikia. ( https://github.com/clips/pattern/pull/17 )



I'm a big fan of data mining so I'll make sure to take it out for spin :) And from fellow belgian people, nice!


This is awesome! Any plans to add other sites, like amazon, yelp, tripadvisor, etc!


NB: screen-scraping Yelp is against the TOS and you'll get shut down pretty fast if you try it.


Exactly. Screen scrape Google search results instead, I've heard that works great. Bet your business model on it, I've heard. ;-)


Well, Google "screen scrapes" millions of websites. They bet their business on it and they seem to be doing OK.

If you upload something to an http server connected to the public internet on tcp/80, and you don't exclude the path to it in robots.txt, then should anyone be surprised if it is copied? HTTP clients don't read TOS.

If Google had to read and interpret every every website's TOS, I doubt they could easily, if at all, produce an index the size of the one they have. It seems by ignoring a "fear of scraping" they managed to produce something valuable that the courts seem to side with in spite of offended copyright holders.

Moreover there's no requirement for them to make their "cache" publicly accessible. But they do. And again this has held up in court quite well. I doubt anyone would be surprised that people are using it. Or "scraping" it if you want to play word games.


I see you worked at yelp before so I will not elaborate more, but it's not too hard to scrape yelp if you know what you are doing


Better yet: Is there a well defined structure for other folks to add that stuff?


I'm not one of the authors, but the code in question is all in one file: https://github.com/clips/pattern/blob/master/pattern/web/__i...

It would be fairly straightforward to add your own class.


Whoa, modules with dozen of classes and 2450 lines of code! They definitely don't rely on file-navigation.

Project is great, but I've got a feeling that many smallish components have been already implemented, but here they were rewritten from scratch.


Is it just me or are they rerlolling requests and lxml for no good reason?


The source code has a base SearchEngine class and a Result class, which streamline the input and output parameters across different web services, so it is not difficult to add new services (= subclass of SearchEngine). There are also some general developer docs: http://www.clips.ua.ac.be/pages/pattern-dev


This looks pretty interesting, I will give it a go




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: