That's a very low-quality article, in my opinion. It takes an entire article to show how to use a simple tool and how to fetch a list of proxies, uses Makefile when a shell script would do just fine, and exaggerates the title.
In addition, it does nothing to mitigate even remotely significant scrape detection. Doesn't talk about request fingerprints, patterns, user agents and headers, multi region access, etc.
This is a hobbyists' guide to scraping under the radar. Fine at that scale but quite incomplete for anything remotely mature or wide reaching.
Same thoughts. Sadly we live in world when it is good enough to get recruiters' attention. Some time ago there was a post on HN about some bullshit like "everyone should have a blog".
Do you know any professionals with exceptional experience that is being shared on their blogs? For example, if someone is interested in .NET I can recommend this one: https://www.wiktorzychla.com
I came here to make this same comment. This article barely begins to scratch the surface on this topic. Effective, high volume scraping involves more than just rotating through a list of proxies.
On a website I'd written we had pseudo-randomly generated URLs to show dynamic content (it was a game, the URL contained parameters). On each page we had this little widget that included five random configurations people might like to try.
A few times our website went down due to the load going >30, eventually I discovered Google was doing something funky, adding the dynamic domains to the "robot.txt" file fixed the issue. Then some other search engines / scrapers seemed to run into the same issue and started requesting hundreds of thousands of URLs per day (these pages were dynamically generated and took a moderate amount of compute power).
We eventually did have to implement basic anti-scraper rules because it was degrading the user experience.
Careful. Using open proxies could be considered unauthorized access in some jurisdictions. Some of these proxies were installed without the user's permission.
Some time ago I was looking for an apartment to buy. Sites in my country are bloated and terribly slow. Checking several offers took minutes. Moreover I live in a city where good offers are sold the same day they are published.
I decided to run scrapper to fetch all the data about available apartments in my city. Thanks to that I was able to browse offer at speed of tinder. It took me a few hours to write all the stuff, it saved me probably weeks.
To avoid getting caught I decided to setup TOR on my raspberry pi and use it as a proxy. It was extremely easy and reliable. Sites were so slow I didn't notice significant performance drop. I didn't care about changing proxies because TOR made it for me.
Except that it is good idea to change User-Agents and add some random delays between calls. Luckily for this case it was enough.
I had to scrape past an access limit per IP once, and just went with IPv6 addresses, of which IPv6 users have plenty. (This was almost five years ago, so it's conceivable that some services would have wisened up a bit and block the entire IPv6 block now.)
>after introducing proxies my crawl times grew by an order of magnitude from minutes to hours
Yeah, same experience. Right now I use luminati.io datacenter IPs that work ok, anyone know of a cheaper option that works well? Scraping tens of millions of pages a month.
I suppose the economics of it comes in to play in a similar vein to mailchimp, the lower the pricing the more scammier clients and the more IPs they lose to blacklists.
Not really. Mail is default bad, you need to build up trust just to get a tiny amount of deliverability. Fetching webpages is default good until you're detected as bad.
The thing is, a lot of scraping goes unnoticed. Maybe you get an extra thousand hits here and there. But every spam campaign gets noticed and results in some percentage of spam complaints from users.
You could use https://oxylabs.io/ or buy some regular VPN accounts and build a http proxy wrapper around that. Extremely cheap and works well. There’s a lot of existing projects on Github for that too.
Depends on your scale and what you negotiate. The shared nature of VPN is usually not a huge problem but depends on your use case. Most VPN providers have a better deal for bigger customers so you can just buy multiple accounts in bulk with each having for example 5 connections and use that. For US specifically it often happens that they have 100s of servers and vpn configs so you can build something out of that.
Depending on what you scrape we can help. we have proxies in 170+ countries, please get in touch at www.speedchecker.xyz , the pricing is cheaper than what you say above
I assume simple BFS and DFS traversal behavior shows up brightly in access logs, making detection more likely. Does it help to use Random First Search[1]? Or is it better to attempt emulating human actors (requires much more develop effort)?
I don't know if scrapy handles this, but I've run into issues with sites fingerprinting my browser. Proxies help, but there are other ways to identify site visitors aside from IP addresses.
Statistics I'd say is one of the useful cases of scraping.
Back in 2013, a guy scraped the results of about 150,000 students giving their 10th grade finals of a particular examination board in India. He showed that not only was there no privacy of student's marks because the roll numbers were all linearly incremented, but there was also mass-scale manipulation of marks going on.
The concept is simple but it's a very interesting read.
Facinating read, I cannot find out more about what happened to him after he was accused of 'hacking' the govt systems, do you have any more sources that sheds some light on that?
Scraping websites like AliExpress and Ebay for trademark and brand infringement. There are many vendors who sell fake tat on the internet, and there are companies out there who track it down on behalf of the brand owners.
These companies tend to have armies of lawyers who can swat away even the likes of eBay when it comes to justifying web scraping. Nevertheless, the work required is still the same tedium that others deal with: CAPTCHAs; throttling; IP bans; etc.
I'm kind of interested in an answer to this as well.
I know the typical "travel site" or "comparison shop" use case.
There is also the "darn it, I want this" use case.
However, automated, periodic web scraping that mutates (I.e. ticket or reservation grabbing bots) has always felt a bit squicky to me in the same way DNS squatting does.
Website you like that invites discussion has a very shitty comment system, that doesn't notify you of the responses to your comments.
Or perhaps you want to monitor mentions of your name, and join the discussion.
Or you want to not lose your past thoughts, because discussions were deep and some may be interesting to re-read in the future.
Or a website is known to allow users delete their comments, or website itself bans users and hides all their content from others for flimsy reasons like "ISIS" in the title, no matter whether it's pro/against/neutral/irrelevant.
Or you want to organize the information differently than the site does.
Or the website is bloated and slow as hell, and you want to use it over gprs, so you create a lightweight/fast/better organized/filtered mirror.
Or the website doesn't have search or it is crap/slooow, and is not indexed in google, like some IRC logs out there.
I've used it to convert publicly available data to a more useful format. A local government agency made the data freely available as a series of thousands of HTML pages, but not in any other form. I didn't need to use any evasive tactics, but it was still wholesale repeated scraping.
I have no idea what tools are available for denying access to web scrapers, this I should know given that I have built a few websites and know what to do to get pages serving quickly. Somehow I missed the memo on how to set your site up to not be scraped. Is there an nginx setting for that?
This could be interesting for people that do scrape sites to know too, what basic reasonable measures can one take beyond looking at logs and doing IP bans?
The cheap technique is rate limiting, but there are also "canary pages" (linked invisibly but should not be accessed by humans) and other techniques for looking to see if the flow through the site is as expected.
IPs are naive and too broad. Force users to authenticate to see your content. Unauthenticated viewers get to see only a scrambled version. Now you can track usage by username.
Establish a baseline as to what constitutes "normal" browsing.
Any user exceeding a threshold of page requests gets rate limited/banned.
Stopping people from recording public information is unrealistic, but you can certainly make it more of a pain in the ass.
I'm sort of on the other side of this problem, but I think an interesting counter-measure came up in a recent Show HN[0].
IP bans alone are probably not enough to stop a motivated scraper, or work longterm, so checking whether or not traffic is coming from a cloud provider and throwing up a captcha would be a significant hurdle, while allowing humans on cloud-hosted VPNs to pass through without much trouble.
I can't speak for that particular service, but in other similar ones I've looked at and used in the past, you're paying for a certain number of IPs _at a given time_.
So, for instance, they have a pool of servers that have 1000 IPs available. Your account allows connections to go out over 2 of those at a time. If something happens (like one gets banned by whatever service you're scraping), you can get a different set of 2 IPs and keep moving.
While you're still paying a relatively high price for what you're consuming (predominantly bandwidth in this case), you're paying for the flexibility.