Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The chapter contains HTML processing but that's a small subset of what this chapter covers. You don't need to learn word stems to do really interesting things with structured HTML. Also, web scraping involves more than text processing, but the programmatic navigation of websites, which does add some complexity but is pretty manageable with the libraries out there.

Edit: I'm obviously not saying NLP isn't useful, just that web scraping is more immediately useful. With NLP, besides learning the concepts, you have to find a source of raw text that's been unprocessed and yet contains something of real world value. With web text, you just have to collect what someone already thought was valuable to publish and find insights through the aggregation. It seems to me that the latter scenario is easier to grasp, with NLP being useful for going beyond what others have gathered and published.



Thing is, this is a book about NLP, not a book about web scraping, so while what you say may be true (although personally I find more value learning about NLP than WS) it seems a little misplaced.

But there is value in both, depending on your objectives. I find web scraping trivial, and mining the text hard, hence my interest in NLP and machine learning.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: