Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How to Web Scrape in 2020?
23 points by alephnan on March 28, 2020 | hide | past | favorite | 6 comments
Are there particular libraries or scraping-as-a-service UIs you would recommend?

I'm particularly interested in restaurant reviews website which has been an increasingly detestable company over the years.




scrapy for python is pretty good, check it out.

In most cases getting banned is the big issue. The bigger the site, the more advanced their bot detection is. You can use luminato.io to get residential and mobile IP's, but it's pricey.

Some sites will also obfuscate the DOM, ie. removing classnames and ID's, which complicates the data extraction.

http://scrapinghub.com/ has a paid "do it for me" service, which may be an option depending on your budget.


Here is a tiny DOM walking script that evaluates all text semantics in a page demonstrating that you don’t need identifiers in the code.

https://github.com/prettydiff/semanticText


Related from 2 month ago "Ask HN: What's state of the art for screen scraping these days?" https://news.ycombinator.com/item?id=22148803 where https://simplescraper.io/ was recommended


For avoiding bans, having a large ipv6 range can help (e.g. like one you might get with a VPS at a proper hosting company). As for grabbing the content itself, I've used a lot of frameworks but I usually end up back at some combination of simple string search and regex.


Depending on what you're scraping you might run into a fair few JS-Only websites that are a pain to scrape. On top of all the things mentioned here you will need to run pages through a headless browser like puppeteer. For these sites you maybe be able to reverse engineer their APIs and attempt to scrape those rather than the pages themselves.


python + beautiful soup




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: