It's been quite a while since I last did web-scraping (I used to use BeautifulSoup, more than a decade ago).
I'm just wondering, since a lot of people are using fairly advanced cloud-hosting solutions with, I assume, tools offered by their respective hosting place to fight spam, is web-scraping a lot different from what it used to be about a decade ago? What steps do you guys take to prevent being identified as a bad actor by the place that you are scraping?
And on the other end, if you have a data-rich website, what are your feelings toward aggressive scrapers?
CDNs like Distil Networks and Cloudflare make scraping more difficult than it used to be. If you get caught by them, you can end up blocked from all of the sites they protect, not just the one you were scraping.
Writing some scrapers this week, I noticed it's also common for the origin server to just check if the request is coming from VPN/VPS IP address range.
For example, the exact same request will work from your home connection where it doesn't work from EC2.
It's gotten pretty challenging from what it used to be.
A lot of small things... but basically if you load from an actual browser (headless) and cycle IPs, it's pretty hard for a site to pinpoint you as a bot vs a user.
I'm just wondering, since a lot of people are using fairly advanced cloud-hosting solutions with, I assume, tools offered by their respective hosting place to fight spam, is web-scraping a lot different from what it used to be about a decade ago? What steps do you guys take to prevent being identified as a bad actor by the place that you are scraping?
And on the other end, if you have a data-rich website, what are your feelings toward aggressive scrapers?