What do you use for scraping? I may have a scraping project later this year and ...

PuerkitoBio · on April 13, 2014

I've written a couple "polite" crawlers in Go (i.e. obeys robots.txt, delays between requests to the same host).

- Fetchbot: https://github.com/PuerkitoBio/fetchbot

Flexible, similar API to net/http (uses a Handler interface with a simple mux provided, supports middleware, etc.)

- gocrawl: https://github.com/PuerkitoBio/gocrawl

Higher-level, more framework than library.

Coupled with goquery (https://github.com/PuerkitoBio/goquery ) to scrape the dom (well, the net/html nodes), this makes custom scrapers trivial to write.

(sorry for the self-promoting comment, but this is quite on topic)

edit: polite crawlers, not scrapers.

djm_ · on April 13, 2014

Scrapy gets a solid recommendation from me. http://scrapy.org/

frabcus · on April 13, 2014

We've got quite an old mailing list full of geeks hand-coding web scrapers, if you want somewhere to ask questions:

https://groups.google.com/forum/#!forum/scraperwiki

egeozcan · on April 13, 2014

I use custom node.js scripts with these libraries:

* request - https://github.com/mikeal/request

* async - https://github.com/caolan/async

* cheerio - https://github.com/cheeriojs/cheerio

* nedb - https://github.com/louischatriot/nedb