Even though I only do it for hobby projects, crawling pages is becoming increasingly difficult unless you are a big player like Google or Microsoft with a whitelisted IP range.
I've had some success in scraping lately with a similar project called FlareSolverr(1).
It's purpose it to get you access to sites which won't let you crawl unless you are using a real browser (e.g amazon, instagram). It doesn't hide your IP but uses puppeteer with stealth mode to get you access to otherwise restricted urls.
For one pet project I had to crawl a rather popular site. While that worked in general I would frequently get internal server errors. Turns out that this was the response of their CDN when it detected bot-like behavior. That left me wondering how they prevent search engine crawlers from being detected as bots and getting throttled this way as well. Turns out they just check the user agent for that. As soon as I put "Googlebot" in my user agent the frequent errors vanished. So sometimes it's not about using the right IP addresses, but just the right keywords in your user agent. ;-)
This probably means incompetency on the CDN’s part - Cloudflare has a detection rule for fake googlebots and it checks by doing reverse DNS to see if it’s really Google. Doing this trick is more likely to get your IP marked for spam, at least if you crawl CF sites with it.
How does Google get around cloaking? Don't they need to visit a site from time to time without coming in as Googlebot to make sure they're getting presented the same page as Googlebot?
They might pretend to be on other networks they own when they do things like ad/policy reviews (eg. the Google Fi or Google Fiber ASNs[0]), but I don't know of anyone confirming that this happens.
Amazon's marketplace APIs are available to developers who register through a seller account. Pricing is $40 per month.
The main benefit of using the API is that you can request a LOT of data without hitting their rate limit. Unless you need to get dozens of results per second, you are usually better off with a spider (or just use Huginn). And if you are hit with a 503 and a captcha, gluing a free captcha solver with some middleware is a trivial task.
But other comments say scraping Amazon is kind of complicated because they ban IPs? I am not sure if you have a seller / affiliate account and then use your home IP to do scaping, will that impact your seller / affiliate account ?
You shouldn't use the same IP to continuously scrape Amazon. Personally I use a $8/month rotating proxy service that gives me a new proxy list every hour (webshare.io if it piques your interest, I'm in no way associated with them).
Also, in my experience Amazon's "ban" comes down to solving a captcha on every request, so it's more like some mild throttling than a real ban.
I'm using a ML-based captcha solver that is free software available on Github. So far it has solved 100% of the Amazon captchas it has encountered.
The reason Amazon have a reputation of being good at blocking IPs is that their responses are (purposefully?) obscure. The way it works, it filters out script kiddies and lets engineers through, which probably are a small minority of the people scraping Amazon.
I also haven't looked into this space outside of a quick DDG search after reading your comment. It looks like the big hole is that Amazon rolled their own captcha a while ago and haven't kept up with what automation can do now.
There's a bot in an IRC channel I've been on for over a decade that announces the <title> of any link being mentioned in the chan. It's becoming less and less useful as it's running on someone's vps, and a lot of sites behind cloudflare don't yield anything as they're returning the "checking your browser" page to the bot. Then there are pages that are pure javascript a d don't even deliver a title tag, and others try to show a GDPR banner or paywall and thus yield some generic title and not whatever article the link is supposed to show.
I guess it's time for a user-side script that sends the http request through their daily-driver browser to see what it is, but then they're getting their home computer to visit any and every link... Maybe only when it sees cloudflare DNS...
I have my crawlers running on a 24/7 notebook at my house because of that. No blocks whatever rate i use. Deployed to digital ocean runs for 5 minutes then blocked forever.
Essentially. All you're doing here is making it easy for the target sites to mass block a bunch of scrapers. Running it off your device is generally better.
This is relatively easy to block, given that these companies publish their IP ranges. Similar to the Tor block, blocking "datacenter IP's" simply becomes the norm. This is why you have companies offering "resi's" or residential proxies to bypass such blocks for some years now.
There have been a number of Chrome extensions which monetize by selling your bandwidth to provide this kind of service. Not ethical but probably not a compromise.
This is not alledgedly but very much true. I was in the proxy business for a few years and in talks with the Luminati people to white-label their product for a very specific type of proxy.
It’s such a weird field to be in. It’s not illegal by definition of law, but you’re definitely in shady territory, with most of the customers being of the “get rich quick” persuasion. Or at the very least trying to cut corners. One way or another, they were not playing by the rules ;-)
A significant portion are grey market "ISPs" that purport to sell residential services but actually never do.
They sell these "residential" IPs to Amazon, other ecommerce retailers and shady people for an extreme price.
In the e-commerce world, scraping is necessary to stay in business. Amazon has armies of scrapers constantly monitoring their competitors, and in some cases automatically undercutting price updates.
Same on the consumer side. I like setting up alerts on camelcamelcamel and seeing price history (to make sure I'm not getting screwed) and to buy buy buy when prices hit a certain theshold.
I've heard of services that offer free proxy bandwidth in exchange for the user acting as a residential proxy node for other users and/or paying clients. It's generally marketed as a way for the user to avoid geoblocks and such. If this is clearly stated to the user upfront rather than buried behind half a dozen dark patterns and fine print, it seems like this business model could be conducted ethically. When it comes to whether the companies in this space are currently acting ethically though, I have serious doubts.
You have autobuying bots (sneakers or GPUs or other limited edition stuff) that need to do tricks like this. I wouldn't necessarily call them unethical.
In my opinion it's unethical if the website explicitly states in their terms of service that bot purchasing is not allowed. As far as I know, that is pretty common for websites that sell frequently scalped products, and near-ubiquitous among those that implement technical countermeasures.
I'd expect cybercriminals and fraudsters also find a pool of disposable residential IPs to be very useful.
A Terms of Service is not a proxy for morals though. Companies will put whatever is convenient for them in there, I'm sure you can come up with several examples that you wouldn't consider ethical.
I worked on a fairly large web-scraping project (around 2 million pages per day) and we used luminati. Amongst other things, they offer genuine residential proxies with user consent.
Reading that page, the “user consent” is dependent on third parties who are monetizing their app through this service to inform their users. I … um … doubt the third party app developers give a crap to accurately describe the traffic that will subsequently emanate from their users’ devices.
Just like everything else in this industry, the retort will be “but it’s the users fault! They didn’t scroll through the 2,375 page privacy policy, user agreement, hold harmless indemnification agreement, and terms of use when they agreed to an ad free experience for their mahjong app! What a stupid user! Ha!”
To that I say, enjoy the coming oppressive regulations. It’s already started. As a kid, I always wondered why we required stupid laws to regulate common sense. Now I know why.
We don't need oppressive regulations; we simply need courts to adopt a sane definition of "agree".
If you're foisting an adhesion contract on more than 1,000 people, they are not deemed to have "agreed" unless a majority of a random sampling (say, ten) of them actually read and understood the entire document. Otherwise it's void. "Read and understood" is decided by a jury as part of any litigation involving the contract. "Random sampling" is made by court evidentiary procedures.
If the contract is negotiated or it was presented to less than 1,000 people the rules stay the way they currently are, since those are the kinds of contracts that English common law was developed for.
luminati runs on holavpn which installs backdoor proxies on people's machines. that "user consent" is not actually known to the users but is buried in some terms & conditions when they're naively installing the "free VPN" software.
In a related note the US Supreme Court just 10 days ago vacated a previous ruling against LinkedIn blocking a Scraper service. So the issue is back to the 9th circuit for a new determination that might, if reversed, change the landscape for Cloudflare and anyone else these operators try to sit behind.
You are basically correct although it seems to hinge on the definition of an authorized use(r) and also they only vacated the appeals court decision so Microsoft will continue with the same case now in the lower court.
There is a good discussion of the more nuanced situation here:
I indirectly worked on a project for a company trying to create an in house web indexer to track user generated content to find trends. Like 80% of the work was making it look like the web scrapers weren't coming from an IP in the cloud.
The IPs for the different cloud providers are super well known, and the big guys either put you behind captcha hell, or just flat out block you if you're coming from one of them.
As someone who has to deal with a lot of bots, bot networks and other weird scraper apps people use: The biggest issue is that most of these tools are not behaved very well. This tool is clearly designed to circumvent protections, rate limits mostly, against scraping that might be essential to keep things running.
They follow links that are explicitly marked as do not follow, they do not even try to limit their rate, they spoof their user agent strings etc etc. These bots cause real problems and cost real money. I do not think that kind of misuse is ethical. In fact, using this tool to circumvent protections can turn your scraping into a DDOS attack, which I do not feel are ethical.
If your bot behaves itself though, public information is public imo. Just don't take down websites, respect rate limits and do not follow 'no-follow' links.
To give an idea of the size of the issue, we have websites for customers that have maybe 5 hits per minute from actual users. Then _suddenly_ you go to 500 hits/minute for a couple of hours, because some bot is trying to scrape a calendar and is now looking for events in 1850 or whatever. (Not the greatest software that these links are still there tbh, but that is out of my control.)
Or another situation, not entirely related, but interesting i think:
A few years back for days on end 80% of our total traffic came from random IPs across china & request could be traced through HTTP referrers where the 'user' had apparently opened a page in one province, then traveled to the other side of China and clicked a link 2 hours later.
All these things are relatively easy to mitigate, but that doesn't make it ethical.
Just so you know, adding rel="nofollow" has never been intended to prevent bots from following those links. Even famous crawlers like Bingbot will sometimes follow, and index pages linked to by "nofollow" links.
The only thing that rel="nofollow" does is tell search engines not to use that link in their PageRank computation.
If you do want to block well-behaved crawlers from crawling parts of your site, the proper way to do that is to use robots.txt rules.
Exactly. We'd get customers thats sites would drown in bot traffic because they pulled the same shit, changing user agents, different ips, etc. I had to build custom mod security rules to block the patterns these boys would pull. What's funny is that the bots would have a site where you could control the crawl rates but it's just a placebo. They would crawl even if you requested them to stop.
The issue with bots hosted on AWS or any cloud for that matter is that as a web host you can't just block the IPs because legitimate traffic comes from them in the form of CMS plugins, backups, etc.
"... because some bot is trying to scrape a calendar and is now looking for events in 1850 or whatever."
Been there, done that - at least on the side of fixing it. Anyone who implements a calendar, don't make pervious and next links that let someone travel time forever.
I've always wondered how much bit traffic costs us - but never actually tried to figure it out. It is a good portion of our traffic - even when we block a lot of it.
I think it depends on the industry. I've worked at a few ecommerce companies and they all either bought data collected with scrapers or had a scraper team. They also paid for bot protection on their website to stop other companies from scraping their data. I have no problem with this, as ultimately consumers get more competitive and usually lower prices. That is assuming that they are scraping a reasonable rate and not sending 500 requests a second.
What does concern me is the other uses that scraping tools have. For example, what's to stop me from writing a bot specifically to fuck with a competitor's analytics and a/b testing?
It doesn't really matter what's ethical and not, or what the site's wish - what matters is what the law says [1]. I don't want my neighbours smoking on the balcony below me, as I can smell their smoke but the law doesn't allow my building to ban it without consensus. Alas...
> I don't want my neighbours smoking on the balcony below me, as I can smell their smoke but the law doesn't allow my building to ban it without consensus.
That's simply because you're in multi-family housing, it would be a different story if the neighbors smoked and put up a fan to blow the smoke into your yard/window/etc, and that's probably a more apt comparison to bots that literally fork bomb themselves to crawl your site at 500 requests per second.
Why can't we talk about ethics? There are a lot of things in this world that are legal, yet I choose not to do them because they are not ethical in my opinion.
We are allowed to talk about ethics separate from the law.
I agree, it would be impossible for sites to block individual customers behind carrier grade NAT without blocking other customers sharing that NAT endpoint.
Of course, the websites can raise an abuse complaint with the ISP, and they may take further action.
A similar approach I use is google cloud sdk + ssh socks5 proxy.
Create some preemptible instances on google cloud first, then connect them with commands like "gcloud compute ssh instances-name -- -D localhost:port".
And the last step is to connect scrapper to those proxy ports over localhost.
I’ve been using lambda to scrape an api that technically is only available in their app. The api uses aws so running the scraper on lamda seems to have hidden it well enough. Doubt they want to start blocking amazon ips.
Hi all - I'm the creator of CloudProxy and only just came across this. Appreciate all the comments and stars (over 600 in less than two days). All the points shared are valid, you won't get the same effectiveness as residential IPs and may face issues with the proxies being blocked. That being said this solution, if you find they're not being blocked, is a lot cheaper and very quick. I've used it to scrape some major websites extensively without issue. I created it for my own use but then decided to open source it. Just hope you guys find it useful!
Hiding behind any larger cloud or VPN looks futile. If there's one place to hide scraper nowadays it's Apple's iCloud+ or whatever stupid name they've picked - two proxies to mimic tor.
I wonder if there's some internal APIs the cloud providers share to let them know when customers release ownership of an IP so that its reputation can be reset.
I have a webshare.io subscription for stuff like this, it's something extremely negligible like $3/mo. for 100 addresses and some liberal bandwidth cap. Always happy to self-build, but sometimes it just doesn't make any sense
I’ve wanted to build something similar with Lambda. I ran an experiment recently and there is pretty good distribution of IP space on AWS Lambda… you could fan out to 20+ pretty easily.
Unpopular opinion around here, but scraping Web pages is often unethical, and the scourge of many Web developers and infrastructure managers.
For Christ sakes people, develop your own products!
If you have to develop and use a cloud based tool to "get around" businesses blocking your scraping app, maybe reevaluate your business and life choices.
no business is under any obligation to provide you access to their data and services for you to build a business on top of.
If you feel that the owner of data or services maintains an illegal monopoly over that data, work to lobby government to resolve that issue.
It's not your individual place to unilaterally decide that a business owes you anything other than what is described in their terms and conditions of use. Period.
I used to have a similar attitude - most typically applied to rules & laws, etc. “If you don’t like the rule, go run for office and get the law changed.”
But then I realized that’s just a cop out. A lot of structures and systems are in place specifically to make it difficult for people to change things and specifically to maintain monopolies.
Nobody’s going to lobby government to resolve this issue. That’s the point. Real estate companies are super happy to keep the status quo.
Sometimes you have to break some rules to innovate. There is a line of course (where you draw it depends on your own ethical code). But I certainly wouldn’t put “scraping real estate data” behind the line.
Scraping a website in your analog analogy would be a store where those index cards would be plastered to the shop’s windows, visible from the outside. The “scraper” would come by every day and manually copy (as in, write in their own notebook) what was on those index cards that are visible from the street.
So the question is: would you consider that stealing?
It's more like you going to the public library to borrow every single book they have. Thereby forcing the library staff to handle all your requests. You would probably not expect to be allowed to do this in real life.
Stop trying to lawyer this through analigies people, think about the actual situation at hand, instead of drawing broken parallels to some hypothetical
> no business is under any obligation to provide you access to their data and services for you to build a business on top of.
I mean, they're the ones who gave it to me, I'm just using it. Not my problem if they don't like it. I never saw any terms and conditions when my browser loaded the page, no reason I'm going to go looking for them before I scrape it. I have just as much right to access it as my web browser does or my phone does.
I'm not saying they owe me anything, they're offering it to me, I'm just taking it.
I'm not saying they owe me anything, they're offering it to me, I'm just taking it.
No you are not. If you are deploying proxy containered services on cloud computing hosts orchestrated by yourself to circumvent mechanisms to prevent such behavior you are delibertly and unethically acting in a manner directly not intended by the owner of such data.
How so? They're still offering it to me, just on a different IP. I fail to see how that becomes unethical. I'm accessing information they publish to the public at large.
Or if you don't want info public, don't make it public then complain people are using it. If you want to stop scrapers, offer an api. Nobody scrapes by choice if there's a supported alternative to getting the same data.
Users are also victims: captcha walls are basically an anti-scrapper countermeasure, which everybody pays for in reduced web usability.
"Proof of work" style countermeasures (where data has to be decoded browser-side in expensive ways that are bearable for a regular user but onerous for a mass scraper) is another externality everybody pays for (the total amount of CPU wasted by regular users is at the system level a total waste).
Most use cases for web scraping is usually related to data analysis, creating machine learning dataset, etc. Google and other search engine services are scraping the web too.
Maybe it’s a better idea to start or subscribe to an ethical scraping initiative with some user agent header that references some kind of policy and verifiable membership id
That's the existing default: if a service doesn't mind scraping, they need not take any countermeasure (other than simple rate limiting with a clear status code). Any site that does go beyond that is telling data thieves that they're not welcome.
I've had some success in scraping lately with a similar project called FlareSolverr(1).
It's purpose it to get you access to sites which won't let you crawl unless you are using a real browser (e.g amazon, instagram). It doesn't hide your IP but uses puppeteer with stealth mode to get you access to otherwise restricted urls.
(1) https://github.com/FlareSolverr/FlareSolverr