Fundamentally, what most scrapers learn is that the more their scraper can behave like a human browsing the site, the less likely they are to get detected and blocked.
This does put limits on how quickly they can crawl, of course, but scrapers find ways around it like changing ip and user agent (ip is probably the main one, bec you can then pretend that you are multiple humans browsing the site normally).
Even changing IPs won't always work against an adversary with a global view of the Internet such as CloudFlare.
CF has a view on a significant chunk of internet traffic across many sites and feeds that into some kind of heuristics/machine learning. Even if we assume that your behavior on the scraped website looks human-like, you may still get blocked or challenged because of your lack of traffic on other sites.
The IPs you'd get from a typical proxy service would only be used for bot activity and would've been classified as such a long time ago, and there's no "human activity" on it to compensate and muddy the waters so to speak.
The best solution is to use IPs with a chunk of legitimate residential traffic, and keep scraping sessions constrained to their IPs - don't rotate your requests among all these IPs, instead every IP should be its own instance of a human-like scraper, using its own user account, browser cookies, etc.
You nailed it! I've also faced issues in the past with captchas, and elaborate bot detection mechanisms. It would also be helpful to mention that there are automatic captcha solvers to bypass security once one is detected. I am wondering if it is worthwhile to provide an addition to this post on how to improve the efficacy of scraping despite these roadblocks. The article is geared towards beginner scrapers that are just starting out so maybe it would be overkill? What do you think?
This does put limits on how quickly they can crawl, of course, but scrapers find ways around it like changing ip and user agent (ip is probably the main one, bec you can then pretend that you are multiple humans browsing the site normally).