Hacker News new | past | comments | ask | show | jobs | submit login

Here are a couple of suggestions:

* Use tagsoup only for small projects or sites that are so broken that other packages fail to parse/load DOM properly.

* Shpider[1] package on Hackage (which I maintain these days) makes it somewhat easier to do the crawling bit. It has an intiutive API and we are always open to suggestions/new functionality there.

* Instead of tagsoup, learn hxt (and arrows along the way). It is really, really hard to get used to, but just amazing in extracting information using combinators from DOM once you're there. Perhaps you could do it as a back-burner learning project. Make sure to look into the arrow proc/do notation as that's pretty much key for scraping.

* Alternatively, you can use one of the xml parsing libraries and their combinators. Some that come to mind are: haxml, hexpat, xml, xml-basic, xml-conduit. I'm sure these would be great too, although I haven't used one in any great capacity.

[1] http://hackage.haskell.org/package/shpider-0.2.1.1




Thanks! I've heard of Shpider, haven't had a need for it as of yet because I haven't needed programmatic web browsing (just downloading the page and extracting info).

I'll see if I can clean up the markup enough to get it to parse with hxt, otherwise Shpider provides a good reference on how to correctly use Tagsoup. And also the Shpider codebase is really clean and well documented.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: