If you're in the .NET world, HtmlAgilityPack does a great job of producing proper XML from broken HTML. CSS selectors don't always suffice, particularly with sites that don't use CSS :) Sometimes you have sites where the best you can do is e.g. get the text of the 2nd h2 header following the span with text 'X'. With some utility functions I can just write:
From which my code then selects the HtmlAgilityPack InnerText (i.e. less formatting, etc.). (Practically speaking, my code also does some case-insensitive translation in there, which is an area where XPath is a bit annoying, plus string trimming, checking and propagating nulls, etc.)
In my experience the greater challenge with scraping lots of data is dealing with stuff like:
- cache disabled in response headers but you've scraped 10K pages and just discovered a page with e.g. a deformed href in an anchor (e.g. "<a href+'....'>"); after giving up trying to understand how the hell they managed that, it's not long before you're writing a crawl repository so you can selectively ignore the caching rules your proxy cache happily abides by so you can quickly restart your debug session for the next weird thing you discover (unfortunately the nature of the site has forced you to do in-memory preprocessing of 50K pages before you can do the real processing for the rest of the site because they have done some OTHER weird stuff)
- sites that treat EVERYTHING as dynamic content even though you could easily cache it... now you get to do the job of the webmaster because you have many data sources you're feeding from and don't want to hammer servers
- sites with bad links but no 404 responses (just redirects); easy to detect, but still a nuisance
- proper request throttling (i.e. throttling on the basis of requests serviced not merely requested)
- dynamically adjusting the above throttling, because sites can be weird :)
- efficiently issuing millions of requests/week to a bunch of sites and scraping data from the responses in custom formats for each site
- site layout changing and breaking your scraping logic. I'm not sure how common this is today, but I was scraping hundreds of commerce sites in 2001, each having several (often 5 but sometimes 50) different product page layouts for different sections, each with its own field names, fields, and crazyness, for a total of a few thousand different "scraping logics" (each just 5-10 lines long, but each had to be individually maintained). Now, every day just two (out of a few thousands) broke, but to keep everything robust, you had to (a) be able to tell which one broke, and (b) fix it within a reasonable time frame. Neither of these is simple.
- sites that depend nontrivially on JavaScript. That gives you the choice of either (a) reverse engineering the javascript, and making your scraper figure out all the details the same way the javascript would, or (b) use something like phantomjs or e.g. a controlled IE session to let the javascript run and then take the data from the DOM. (a) is more efficient, more work but was (unexpectedly for me) much more stable. (b) is less work upfront, more maintenance, and a LOT more resource intensive.
- sites whose traffic management system you trip while scraping. Many will block you, some actively (with an error message, so you know what is happening), some will just keep you hanging or throttle you down to a few hundred bytes/second all of a sudden, with no explanation and no one to contact. Amazon contacted us when they figured we were scraping (we weren't hiding anything and doing it with a logged in user that had contact details), and were cool about it.
- sites that randomly break and stop in the middle of a page. Happens much more than you'd think; When using the site, you just reload or interact with a half-loaded page. You could, of course, still scrape a half-loaded page - but what if only 20/23 of the items you need are there? What if the site is stateful, and reloading that page would cause a state change you do not want?
In my experience the greater challenge with scraping lots of data is dealing with stuff like:
- cache disabled in response headers but you've scraped 10K pages and just discovered a page with e.g. a deformed href in an anchor (e.g. "<a href+'....'>"); after giving up trying to understand how the hell they managed that, it's not long before you're writing a crawl repository so you can selectively ignore the caching rules your proxy cache happily abides by so you can quickly restart your debug session for the next weird thing you discover (unfortunately the nature of the site has forced you to do in-memory preprocessing of 50K pages before you can do the real processing for the rest of the site because they have done some OTHER weird stuff)
- sites that treat EVERYTHING as dynamic content even though you could easily cache it... now you get to do the job of the webmaster because you have many data sources you're feeding from and don't want to hammer servers
- sites with bad links but no 404 responses (just redirects); easy to detect, but still a nuisance
- proper request throttling (i.e. throttling on the basis of requests serviced not merely requested)
- dynamically adjusting the above throttling, because sites can be weird :)
- efficiently issuing millions of requests/week to a bunch of sites and scraping data from the responses in custom formats for each site