Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to scrape anything on the web and not get caught (tinyendian.com)
184 points by karolmajta on April 21, 2018 | hide | past | favorite | 55 comments


That's a very low-quality article, in my opinion. It takes an entire article to show how to use a simple tool and how to fetch a list of proxies, uses Makefile when a shell script would do just fine, and exaggerates the title.


In addition, it does nothing to mitigate even remotely significant scrape detection. Doesn't talk about request fingerprints, patterns, user agents and headers, multi region access, etc.

This is a hobbyists' guide to scraping under the radar. Fine at that scale but quite incomplete for anything remotely mature or wide reaching.


Same thoughts. Sadly we live in world when it is good enough to get recruiters' attention. Some time ago there was a post on HN about some bullshit like "everyone should have a blog".

Do you know any professionals with exceptional experience that is being shared on their blogs? For example, if someone is interested in .NET I can recommend this one: https://www.wiktorzychla.com


I came here to make this same comment. This article barely begins to scratch the surface on this topic. Effective, high volume scraping involves more than just rotating through a list of proxies.


I do web crawling for a living, the method mentioned in the article does not work for most sites


I've been working on a project that requires scraping from a large number of sites. Do you have any recommendations on better resources?


Do most sites have scraping detection at all? Are they even opposed to scraping?


On a website I'd written we had pseudo-randomly generated URLs to show dynamic content (it was a game, the URL contained parameters). On each page we had this little widget that included five random configurations people might like to try.

A few times our website went down due to the load going >30, eventually I discovered Google was doing something funky, adding the dynamic domains to the "robot.txt" file fixed the issue. Then some other search engines / scrapers seemed to run into the same issue and started requesting hundreds of thousands of URLs per day (these pages were dynamically generated and took a moderate amount of compute power).

We eventually did have to implement basic anti-scraper rules because it was degrading the user experience.


Careful. Using open proxies could be considered unauthorized access in some jurisdictions. Some of these proxies were installed without the user's permission.

This is my favorite consensual alternative: https://github.com/mattes/rotating-proxy


This solution will not work with HTTPS. There are other alternatives and it's easy to roll your own.


For example?


Some time ago I was looking for an apartment to buy. Sites in my country are bloated and terribly slow. Checking several offers took minutes. Moreover I live in a city where good offers are sold the same day they are published.

I decided to run scrapper to fetch all the data about available apartments in my city. Thanks to that I was able to browse offer at speed of tinder. It took me a few hours to write all the stuff, it saved me probably weeks.

To avoid getting caught I decided to setup TOR on my raspberry pi and use it as a proxy. It was extremely easy and reliable. Sites were so slow I didn't notice significant performance drop. I didn't care about changing proxies because TOR made it for me.

Except that it is good idea to change User-Agents and add some random delays between calls. Luckily for this case it was enough.


I had to scrape past an access limit per IP once, and just went with IPv6 addresses, of which IPv6 users have plenty. (This was almost five years ago, so it's conceivable that some services would have wisened up a bit and block the entire IPv6 block now.)


>after introducing proxies my crawl times grew by an order of magnitude from minutes to hours

Yeah, same experience. Right now I use luminati.io datacenter IPs that work ok, anyone know of a cheaper option that works well? Scraping tens of millions of pages a month.


I suppose the economics of it comes in to play in a similar vein to mailchimp, the lower the pricing the more scammier clients and the more IPs they lose to blacklists.


Not really. Mail is default bad, you need to build up trust just to get a tiny amount of deliverability. Fetching webpages is default good until you're detected as bad.

The thing is, a lot of scraping goes unnoticed. Maybe you get an extra thousand hits here and there. But every spam campaign gets noticed and results in some percentage of spam complaints from users.


You could use https://oxylabs.io/ or buy some regular VPN accounts and build a http proxy wrapper around that. Extremely cheap and works well. There’s a lot of existing projects on Github for that too.


oxylabs.io seems more expensive than luminati, minimum of $178/month. Not clear if they charge for bandwidth.

Problem with VPN is it's shared and hard to get a lot of IPs, any specific ones I could get say 100 dedicated US IPs for a reasonable price?


Depends on your scale and what you negotiate. The shared nature of VPN is usually not a huge problem but depends on your use case. Most VPN providers have a better deal for bigger customers so you can just buy multiple accounts in bulk with each having for example 5 connections and use that. For US specifically it often happens that they have 100s of servers and vpn configs so you can build something out of that.


Depending on what you scrape we can help. we have proxies in 170+ countries, please get in touch at www.speedchecker.xyz , the pricing is cheaper than what you say above


I assume simple BFS and DFS traversal behavior shows up brightly in access logs, making detection more likely. Does it help to use Random First Search[1]? Or is it better to attempt emulating human actors (requires much more develop effort)?

[1] https://bl.ocks.org/mbostock/11161648


This article has the lowest quality ever seen on HN. The title is hugely exaggerated.

Using a list of proxies and hope that is enough to scrape _anything_ on web?


I don't know if scrapy handles this, but I've run into issues with sites fingerprinting my browser. Proxies help, but there are other ways to identify site visitors aside from IP addresses.


Can't you use a service like Panopticlick to 'tune' the browser fingerprint to a majority percentile? Particularly if running in clean VM images.


Rotating user agents ?


Minimally effective, or outright detrimental, in my own limited testing.


Some proxies will also include your IP address in the form of an "X-Forwarded-For" header.


What sorts of real-world and legitimate/ethical use cases are there for wholesale repeated scraping?


Statistics I'd say is one of the useful cases of scraping.

Back in 2013, a guy scraped the results of about 150,000 students giving their 10th grade finals of a particular examination board in India. He showed that not only was there no privacy of student's marks because the roll numbers were all linearly incremented, but there was also mass-scale manipulation of marks going on.

The concept is simple but it's a very interesting read.

https://deedy.quora.com/Hacking-into-the-Indian-Education-Sy...

I was one of the 150,000 kids that gave those exams back in 2013 :)


Facinating read, I cannot find out more about what happened to him after he was accused of 'hacking' the govt systems, do you have any more sources that sheds some light on that?



Scraping websites like AliExpress and Ebay for trademark and brand infringement. There are many vendors who sell fake tat on the internet, and there are companies out there who track it down on behalf of the brand owners.

These companies tend to have armies of lawyers who can swat away even the likes of eBay when it comes to justifying web scraping. Nevertheless, the work required is still the same tedium that others deal with: CAPTCHAs; throttling; IP bans; etc.


I'm kind of interested in an answer to this as well.

I know the typical "travel site" or "comparison shop" use case.

There is also the "darn it, I want this" use case.

However, automated, periodic web scraping that mutates (I.e. ticket or reservation grabbing bots) has always felt a bit squicky to me in the same way DNS squatting does.


Website you like that invites discussion has a very shitty comment system, that doesn't notify you of the responses to your comments.

Or perhaps you want to monitor mentions of your name, and join the discussion.

Or you want to not lose your past thoughts, because discussions were deep and some may be interesting to re-read in the future.

Or a website is known to allow users delete their comments, or website itself bans users and hides all their content from others for flimsy reasons like "ISIS" in the title, no matter whether it's pro/against/neutral/irrelevant.

Or you want to organize the information differently than the site does.

Or the website is bloated and slow as hell, and you want to use it over gprs, so you create a lightweight/fast/better organized/filtered mirror.

Or the website doesn't have search or it is crap/slooow, and is not indexed in google, like some IRC logs out there.

Or you want to have content available offline.

...


I've used it to convert publicly available data to a more useful format. A local government agency made the data freely available as a series of thousands of HTML pages, but not in any other form. I didn't need to use any evasive tactics, but it was still wholesale repeated scraping.


Scrape government data... it’s all free but try to find it in any downloadable and consistent format.


Internet archival initiatives are a legitimate use case, as well as things like sneakernet in Cuba.


I have no idea what tools are available for denying access to web scrapers, this I should know given that I have built a few websites and know what to do to get pages serving quickly. Somehow I missed the memo on how to set your site up to not be scraped. Is there an nginx setting for that?

This could be interesting for people that do scrape sites to know too, what basic reasonable measures can one take beyond looking at logs and doing IP bans?


The cheap technique is rate limiting, but there are also "canary pages" (linked invisibly but should not be accessed by humans) and other techniques for looking to see if the flow through the site is as expected.


IPs are naive and too broad. Force users to authenticate to see your content. Unauthenticated viewers get to see only a scrambled version. Now you can track usage by username.

Establish a baseline as to what constitutes "normal" browsing.

Any user exceeding a threshold of page requests gets rate limited/banned.

Stopping people from recording public information is unrealistic, but you can certainly make it more of a pain in the ass.


I'm sort of on the other side of this problem, but I think an interesting counter-measure came up in a recent Show HN[0].

IP bans alone are probably not enough to stop a motivated scraper, or work longterm, so checking whether or not traffic is coming from a cloud provider and throwing up a captcha would be a significant hurdle, while allowing humans on cloud-hosted VPNs to pass through without much trouble.

[0] https://news.ycombinator.com/item?id=16868012


Captchas are one (drastic) option.


Any image based captcha where one needs to identify words from an image can be easily broken by algorithms now.

Google would be the leader with re-captcha, but as a human, I fail a whole number of them. They are a very annoying experience to your users.


> proxy-lists getProxies --sources-white-list="gatherproxy,sockslist"

Is there a reason for using only gatherproxy and sockslist? There are more lists [0] available.

[0] https://github.com/chill117/proxy-lists/blob/4bb8064703b09ee...


I use paid service proxy bonanza ($12/mo for 2 IPs), and build my own as well using squid ($5/mo DigitalOcean).


You pay more per IP than per server!? Why don't you get 3 DO instances then?


I can't speak for that particular service, but in other similar ones I've looked at and used in the past, you're paying for a certain number of IPs _at a given time_.

So, for instance, they have a pool of servers that have 1000 IPs available. Your account allows connections to go out over 2 of those at a time. If something happens (like one gets banned by whatever service you're scraping), you can get a different set of 2 IPs and keep moving.

While you're still paying a relatively high price for what you're consuming (predominantly bandwidth in this case), you're paying for the flexibility.


The paid service has IPs from a lot of other countries not covered by DO :(


Contrary to popular belief, a lot of high traffic sites can be scrapped by a single IP without hitting access limits.


In what way is the article not scummy as hell? You shouldn't waste Jenkins server time with this...


Could this be a solution?: run a website, and let your visitors do the crawling.


You download some popular iOS game. But in the background it’s fetching a List of urls, crawling it and sending results back to mothership.

How would an average joe even know?


Won’t work. Adversaries can taint your data by sending back fake results.


Could multiple downloads and a "consensus" algorithm solve this problem?


Kinda, but it’s far from trivial as you would need some sort of tolerance when comparing sites with dynamic content.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: