Ask HN: How to Web Scrape in 2020?

marcell · on March 29, 2020

scrapy for python is pretty good, check it out.

In most cases getting banned is the big issue. The bigger the site, the more advanced their bot detection is. You can use luminato.io to get residential and mobile IP's, but it's pricey.

Some sites will also obfuscate the DOM, ie. removing classnames and ID's, which complicates the data extraction.

http://scrapinghub.com/ has a paid "do it for me" service, which may be an option depending on your budget.

austincheney · on March 29, 2020

Here is a tiny DOM walking script that evaluates all text semantics in a page demonstrating that you don’t need identifiers in the code.

https://github.com/prettydiff/semanticText

mtmail · on March 28, 2020

Related from 2 month ago "Ask HN: What's state of the art for screen scraping these days?" https://news.ycombinator.com/item?id=22148803 where https://simplescraper.io/ was recommended

krageon · on March 30, 2020

For avoiding bans, having a large ipv6 range can help (e.g. like one you might get with a VPS at a proper hosting company). As for grabbing the content itself, I've used a lot of frameworks but I usually end up back at some combination of simple string search and regex.

jamil7 · on March 30, 2020

Depending on what you're scraping you might run into a fair few JS-Only websites that are a pain to scrape. On top of all the things mentioned here you will need to run pages through a headless browser like puppeteer. For these sites you maybe be able to reverse engineer their APIs and attempt to scrape those rather than the pages themselves.

ariosto · on March 30, 2020

python + beautiful soup