Hacker News new | past | comments | ask | show | jobs | submit login

While I have sympathy for what the scrapers are trying to do in many cases, it bothers me that this doesn't seem to address what happens when badly-behaved scrapers cause, in effect, a DOS on the site.

For the family of sites I'm responsible for, bot traffic comprises a majority of traffic - that is, to a first approximation, the lion's share of our operational costs are from needing to scale to handle the huge amount of bot traffic. Even when it's not as big as a DOS, it doesn't seem right to me that I can't tell people they're not welcome to cause this additional system load.

Or even if there was some standardized way that we could provide a dumb API, just giving them raw data so we don't need to incur the additional expense of the processing for creature comforts on the page designed to make our users happier but the bots won't notice.




I've told this story before, but it was fun, so I'm sharing it again:

I'll skip the details, but a previous employer dealt with a large, then-new .mil website. Our customers would log into the site to check on the status of their invoices, and each page load would take approximately 1 minute. Seriously. It took about 10 minutes to log in and get to the list of invoices available to be checked, then another minute to look at one of them, then another minute to get out of it and back into the list, and so on.

My job was to write a scraper for that website. It ran all night to fetch data into our DB, and then our website could show the same information to our customers in a matter of milliseconds (or all at once if they wanted one big aggregate report). Our customers loved this. The .mil website's developer hated it, and blamed all sorts of their tech problems on us, although:

- While optimizing, I figured out how to skip lots of intermediate page loads and go directly to the invoices we wanted to see.

- We ran our scraper at night so that it wouldn't interfere with their site during the day.

- Because each of our customers had to check each one of their invoices every day if they wanted to get paid, and we were doing it more efficiently, our total load on their site was lower than the total load of our customers would be.

Their site kept crashing, and we were there scapegoat. It was great fun when they blamed us in a public meeting, and we responded that we'd actually disabled our crawler for the past week, so the problem was still on their end.

Eventually, they threatened to cut off all our access to the site. We helpfully pointed out that their brand new site wasn't ADA compliant, and we had vision-impaired customers who weren't able to use it. We offered to allow our customers to run the same reports from our website, for free, at no cost to the .mil agency, so that they wouldn't have to rebuild their website from the ground up. They saw it our way and begrudgingly allowed us to keep scraping.


This sounds like exactly what a 'data ownership' law would solve. Allow the user, via some official oauth to their service providers, to authorize even a competitor to access their account so the competitor can bear the burden of interfacing with the API to port their new user's data over; but it should be a one-time-every-year thing, so that the law doesn't require it to the point of forcing companies to scale their service to handle bots like the main OP is experiencing.


I have worked with .mil customers who paid us to scrape and index their website because they didn't have a better way to access their official, public documents.


This is not .mil specific: I've been told of a case where an airline first legally attacked a flight search engine (Skyscanner) for scraping, and then told them to continue when they realized that their own search engine couldn't handle all the traffic, and even if it could, it was more expensive per query than routing via Skyscanner.


Michael Lewis' podcast had an episode recently where the Athena Health people related a (self-promotional) anecdote that, after they had essentially reverse-engineered the insurers' medical billing systems and were marketing it as software to providers, a major insurance company called them up and asked to license information about their own billing system because their internal systems were too complicated to understand.


Yep. Have seen similar things.


Me too but for a private company

In reality it was probably more like org sub group A wanted to leverage org sub group B’s data but they didn’t cooperate


Amazing story :) Though I am left wondering if there are ever any circumstances where minorities don't get used as leverage somehow


Yeah, that was unfortunate. We had precious few Federal-strength levers at our disposal, though, and sometimes you have to go with what's available.


I have sympathy for your operational issues and costs, but isn't this kind of complaint the same as a shopping mall/center complaining of people who go in, check some info and go out without buying?

I understand that bots have leverage and automation, but so does you to reach a larger audience. Should we continue to benefit from one side of the leverage, while complaining about the other side?


It's more like a mall complaining that while they're trying to serve 1000 customers, someone has gone and dumped 10000000 roombas throughout the stores which are going around scanning all the price tags.


No. When I say that bots exceed the amount of real traffic, I'm including people "window shopping" on the good side.

My complaint is more like, somebody wants to know the prices of all our products, and that we have roughly X products (where X is a very large number). They get X friends to all go into the store almost simultaneously, each writing down the price of the particular product they've been assigned to research. When they do this, there's scant space left in the store for even the browsing kind of customers to walk in. (of course I exaggerate a bit, but that's the idea)


I’m sympathetic to the complaints about “rude” scraping behavior but there’s an easy solution. Rather than make people consume boatloads of resources they don’t want (individual page views, images, scripts, etc.) just build good interoperability tools that give the people what they want. In the physical example above that would be a product catalog that’s easily replicated with a CSV product listing or an API.


You don't know why any random scraper is scraping you and thus you don't know what api to build that will do them from scraping. Also, it's likely easier for them to contribute scraping than write a bunch of code to integrate with your API so it there's no incentive for them to do so either.


Just advertise the API in the headers. Or better yet, set the buttons/links only to be accessible via .usetheapi-dammit selector. Lastly, provide an API and a “developers.whatever.com” domain to report issues with the API, get API keys, and pay for more requests. It should be pretty easy to setup, especially if there’s an internal API available behind the frontend already. I’d venture a dev team could devote 20% to a few sprints and have an MVP thing up and running.


For the 2nd part, I have done scraping and would always opt for an API if the price is reasonable over paying nosebleed amounts for residential proxies


I think lots of website owners know exactly where the value in their content exists. Whether or not they want to share that in a convenient way, especially to competitors etc is another story.

That said if scraping is inevitable, it’s immensely wasteful effort to both the scraper and the content owner that’s often avoidable.


Writing a scraper for a webpage is typically far more development effort than writing an API wrapper


Yes, but the scraper in this context is already built. A bird in hand and all that.


In this case, yes, obviously. But as far as "it's likely easier for them to contribute scraping than write a bunch of code to integrate with your API", that presupposes no existing integration.


Yes, exactly. Nobody is standing up and saying "we're the ones doing this, and here's what we wish you'd put in an API".

Also, I'm a big Jenson Button fan.


No, because those are people going to the mall. Not robots 100x the quantity of real people.


Unlike the mall a website can scale up to serve more users at relatively low cost. Also those robots may bring more people to the website. Potentially a lot more even.


Reading your comment my impression is that this is either an exaggeration or a very unique type of site if bots make up the majority of traffic to the point that scrapers are anywhere near the primary load factor.

Would someone let me know if I’m just plain wrong in this assumption? I’ve run many types of sites and scrapers have never been anywhere close to the main source of traffic or even particularly noticeable compared to regular users.

Even considering a very commonly scraped site like LinkedIn or Craigslist - for any site of any magnitude like this public pages are going to be cached so additional scrapers are going to have negligible impact. And a rate limit is probably one line of config.

I’m not saying you are necessarily wrong, but I can’t imagine a scenario that you’re describing and would love to hear of one.


Bots are an incredibly large source of traffic on the non-profit academic cultural heritage site I work on. It gets very little human traffic compared to a successful for-profit site.

But the bots on my site -- at least the obvious ones that lead me to say they are a large source of traffic -- are all well-behaved, with good clear user-agents, and they respect robots.txt, so I could keep them out if I wanted.

I haven't wanted because, why? I have modified the robots.txt to keep the bots out of some mindless loops trying every combination of search criteria to access a combinatorial expansion of every possible search results page. That was doing neither of us any good, was exceeding the capacity of our papertrail plan (which is what brought it to our attention) -- and every actual data page is available in a sitemap that is available to them if they want it, they don't need to tree-search every possible search results page!

In some cases I've done extra work to change URL patterns so I could keep them out of such useless things with a robots.txt more easily, without banning them altogether. Because... why not? The more exposure the better, all our info is public. We like our pretty good organic Google SEO, and while I don't think anyone else is seriously competing with google, I don't want to privilege google and block them out either.


As another example, I used to work on a site that was roughly hotel stays. A regular person might search where to stay in small set of areas, dates and usually the same number of people.

Bots would routinely try to scrape pricing for every combination of {property, arrival_date, departure_date, num_guests} in the next several years. The load to serve this would have been vastly higher than real customers, but our frontend was mostly pretty good at filtering them out.

We also served some legitimate partners that wanted basically the same thing via an API... and the load was in fact enormous. But at least then it was a real partner with some kind of business case that would ultimately benefit us, and we could make some attempt to be smart about what they asked for.


It's a B2B ecommerce site. Our annual revenue from the site would put us on the list of top 100 ecommerce sites [1] (we're not listed because ecommerce isn't the only businesss we do. With that much potential revenue to steal from us, perhaps the stakes are higher.

As described elsewhere, rate limiting doesn't work. The bots come from hundreds to thousands of separate IPs simultaneously, cooperating in a distributed fashion. Any one of them is within reasonable behavioral ranges.

Also, caching, even through a CDN doesn't help. As a B2B site, all our pricing is custom as negotiated with each customer. (What's ironic is that this means that the pricing data that the bots are scraping isn't even representative - it only shows what we offer walkup, non-contract customers.) And because the pricing is dynamic, it also means that the scraping to get these prices is one of the more computationally expensive activities they could do.

To be fair, there is some low-hanging fruit in blocking many of them. Like, it's easy to detect those that are flooding from a single address, or sending SQL injection attacks, or just plain coming from Russia. I assume those are just the script kiddies and stuff. The problem is that it still leaves a whole lot of bad actors once these are skimmed off the top.

[1] https://en.wikipedia.org/wiki/List_of_largest_Internet_compa...


If the queries are expensive because of custom negotiated prices and these bots are scraping the walkup prices, can you not just shard out the walkup prices and cache them?

Being on that list puts the company's revenue at over $1 billion USD. At a certain point it becomes cheaper and easier to fix the system to handle the load.


Indeed - this is one of the strategies we're considering.


> As a B2B site, all our pricing is custom as negotiated with each customer ... the pricing is dynamic

So your company is deliberately trying to frustrate the market, and doesn't like the result of third parties attempting to help market efficiency? It seems like this is the exact kind of scraping that we generally want more of! I'm sorry about your personal technical predicament, but it doesn't sound like your perspective is really coming from the moral high ground here.


So your company is deliberately trying to frustrate the market, and doesn't like the result of third parties attempting to help market efficiency?

No. First, we as a middleman resseller MUST provide custom prices, at least to a certain degree. Consider that it's typical for manufacturers to offer different prices to, e.g., schools. This is reflected by offering to us (the middleman) a lower cost, which we pass on to applicable customers. Further, the costs and prices vary similarly from one country to another. Less obviously, many manufacturers (e.g., Microsoft, Adobe, HP) offer licensing programs that entitle those enrolled to purchase their products at a lower cost. So if nothing else, the business terms of the manufacturers whose products we sell necessitates a certain degree of custom pricing.

Second, it seems strange to characterize as "frustrating the market" what we're doing when we cooperate with customers who want to structure their expenses in different ways - say, getting a better deal on expensive products that can be classified as "capital expenses" while allowing us to recover some of that revenue by charging them somewhat more for the products that they'd classify as operational expenses.


You're just describing a cooperative effort to obfuscate pricing and frustrate a market. So sure, your company could be blameless and the manufacturers are solely responsible for undermining price signals. I've still described the overall dynamic that your company is participating in. It's effectively based around closed world assumptions of information control, and so it's not surprising that it conflicts with open world ethos like web scraping.

> it seems strange to characterize as "frustrating the market" what we're doing when we cooperate with customers who want to structure their expenses in different ways

I'm characterizing the overall dynamic of keeping market price discovery from working as effectively. How you may be helping customers in other ways is irrelevant.


Holy moly doesn’t that sound more like tax evasion or fraudulent accounting than financial planning?

They’re trying to pay less tax by convincing your company to put a different price on products they buy based on their tax strategy.

It sounds illegal.


The majority of tax and other civil laws are basically full of things that are illegal/problematic if you do them individually, but if you can find someone else to cooperate with then it becomes fine.


This makes a lot of sense to me.

What do you think about their other assertion that the search page is getting a gigantic number of hits that a/ cannot be cached and b/ cannot be rate limited because they're using a botnet?


I'm guessing the bots are hitting the search page because it contains the most amount of information per hit, and that the caching problems are exactly due to these dynamically generated prices or other such nonsense. After all, the fundamental goal of scraping is to straightforwardly enumerate the entire dataset.

The scale of the botnet sounds like an awfully determined and entrenched adversary, likely arising because this company has been frustrating the market for quite some time. A good faith API wouldn't make the bots change behavior tomorrow, but they certainly would if there were breaking page format changes containing a comment linking to the API.


Thanks for the explanation!

The thing I still don’t understand is why (edit server not cdn) caching doesn’t work - you have to identify customers somehow, and provide everyone else a cached response at the server level. For that matter, rate limit non-customers also.


The pages getting most of the bot action are search and product details.

Search results obviously can't be cached, as it's completely ad hoc.

Product details can't be cached either, or more precisely, there are parts of each product page that can't be cached because

* different customers have different products in the catalog

* different products have different prices for a given product

* different products have customer-specific aliases

* there's a huge number of products (low millions) and many thousands of distinct catalogs (many customers have effectively identical catalogs, and we've already got logic that collapses those in the backend)

* prices are also based on costs from upstream suppliers, which are themselves changing dynamically.

Putting all this together, the number of times a given [product,customer] tuple will be requested in a reasonable cache TTL isn't very much greater than 1. The exception being for walk-up pricing for non-contract users, and we've been talking about how we might optimize that particular cases.


Ahhhhh, search results makes a whole lot more sense! Thank you. Search can't be cached and the people who want to use your search functionality as a high availability API endpoint use different IP addresses to get around rate limiting.

The low millions of products also makes some sense I suppose but it's hard to imagine why this doesn't simply take a login for the customer to see the products if they're unique to each customer.

On the other hand, I suspect the price this company is paying to mitigate scrapers is akin to a drop of water in the ocean, no? As a percent of the development budget it might seem high and therefore seem big to the developer, but I suspect the CEO of the company doesn't even know that scrapers are scraping the site. Maybe I'm wrong.

Thanks again for the multiple explanations in any case, it opened my eyes to a way scrapers could be problematic that I hadn't thought about.


Good explanation, thank you.

I would think that artificially slowing down search results can discourage part of the bots. Humans don't care much it a starch finishes in 5 seconds and not 2 AFAIK.

Especially on backends where each request is a relatively cheap operations wise (especially when each request is a green thread like in Erlang/Elixir), I think you can score a win against the bots.

Have you attempted something like this?


This is really interesting but they’re using a network of bots already - even if you put a spinner that makes them wait a couple seconds the scrapers would just make more parallel requests no?


Yes, they absolutely will, but that's the strength of certain runtimes: green threads (i.e. HTTP request/response sessions in this case) cost almost nothing so you can hold onto 5-10 million of them on a VPS with 16-32 GB RAM, easily.

I haven't had to defend against extensive bot scraping operations -- only against simpler ones -- but I've utilized such a practice in my admittedly much more limited experience, and was actually successful. Not that the bots gave up but their authors realized they can't accelerate the process of scraping data so they dialed down their instances, likely to save money from their own hosting bills. Win-win.

Apologies, I don't mention to lecture you, just sharing a small piece of experience. Granted that's very specific to the backend tech but what the heck, maybe you'll find the tidbit valuable.


If you've got a site with a lot of pages, bot traffic can get pretty big. Things like a shopping site with a large number of products, a travel site with pages for hotels and things to do, something to do with movies or tv shows and actors, basically anything with a large catalog will drive a lot of bot traffic.

It's been forever since I worked at Yahoo Travel, but bot traffic was significant then, I'd guess roughly 5-10% of the traffic was declared bots, but Yandex and Baidu weren't agressive crawlers yet, so I wouldn't be terribly surprised if a site with a large catalog that wasn't top 3 with humans would have a majority of traffic as bots. For the most part, we didn't have availability issues as a result of bot traffic, but every once in a while, a bot would really ramp up traffic and cause issues, and we would have to carefully design our list interfaces to avoid bots crawling through a lot of different views of the same list (while also trying to make sure they saw everything in the list). Humans may very well want to have all the narrowing options, but it's not really helpful to expose hotels near Las Vegas starting with the letter M that don't have pools to Google.


I appreciate the response but I’m still perplexed. It’s not about the percent of traffic if that traffic is cached. And rate limiting also prevents any problems. It just doesn’t seem plausible that scrapers are going to DDoS a site per the original comment. I suppose you’d get bad traffic reports and other problems like log noise, but claiming it to be a general form of DDoS really does sound like hyperbole.


> a very unique type of site if bots make up the majority of traffic

Pretty much Twitter and the majority of such websites.


Do you really believe bots make up a significant amount of Twitter’s operating cost? Like I said they’re just accessing cached tweets and are rate limited. How can the bot usage possibly be more than a small part of twitter’s operating cost?


Bandwidth isn't free.


I didn’t say it is free, I said that the bandwidth for bots is negligible compared to that of regular users.


Negligible isn’t free either.


I'm sympathetic to this. I built a search engine for my senior project and my half baked scraper ended up taking down duke law's site during their registration period. Ended up getting a not so kindly worded email from them, but honestly this wasn't an especially hard problem to solve. All of my traffic was coming from the cluster that was on my university's subnet, it wouldn't have been that hard to for them to IP address timeouts when my crawler started scraping thousands of pages a second on their site. Not to victim blame, this was totally my fault, but I was a bit surprised that they hadn't experienced this before with how much automated scraping goes on.


I’m honestly more interested in bot detection than anything else at this point.

It seems like it should be perfectly legal to detect and then hold the connection open for a long period of time without giving a useful response. Or even send highly compressed gzip responses designed to fill their drives.

Legal or not, I can’t see any good reason that we can’t make it painful.


Make it painful if they abuse the site.

We all benefit from open data. Polite scrapers are just fine and a natural part of the web ecosystem.

Google has been scraping the web all day every day for decades now.


The court just ruled that scraping on its own isn't a violation of the CFAA. Meaning it doesn't count as the crime of "accessing a protected computer without authorization or exceeding authorized access and obtaining information".

However presumably all the other provisions of the CFAA still apply, so if your scraping damages the functioning of a internet service then you still would have committed the crime of "Damaging a protected computer by intentional access". Negligently damaging a protected computer is punishable by 1 year in prison on the first offence. Recklessly damaging a protected computer is punishable by 1-5 years on the first offense. And intentionally damaging a protected computer is punishable by 1-10 years for the first offense. These penalties can go up to 20 years for repeated offenses.


As someone that has been on the other end, I can tell you devs don’t want to use selenium or inspect requests to reverse engineer your UI and wish there were more clean APIs.

Have you tried making your UI more challenging to scrape and adding a simple API that requires free registration?


So much this!

I work in E-Commerce and (needless to say) we scrape a lot of websites. Due to our growth and the increase in scrapers we require, I’ve been writing a proposal to a higher up to talk to our biggest competitors to all set up a public API that batches the data for a smaller amount of requests.

It would save everyone quite some traffic and effort.


That's what rate-limiting is for. Don't be so aggressive with it that you start hitting the faster visitors, however, or they may soon go somewhere else (has happened to me a few times).


Rate limiting isn't an effective defense for us.

First, as a B2B site, many of our users from a given customer (and with huge customers, that can be many) are coming through the same proxy server, effectively presenting to us as the same IP,

Second, the bots years back became much more sophisticated than a single, or even relatively finite, IP. Today they work AWS, Azure, GCP, and other cloud services. So the IPs that they're assigned today will be different tomorrow. Worse, the IPs that they're assigned today may well be used by a real customer tomorrow.


If your users are logged in you can rate limit by user instead of by IP. This mostly solves this problem. Generally what I do is for logged in users I rate limit by user, then for not-logged-in users I rate limit aggressively by IP. If they hit the limit the message lets them know that they can get around it by logging in. Of course this depends on user accounts having some sort of cost to create. I've never actually implemented it but considered having only users who have made at least one purchase bypass the IP limit or otherwise get a bigger rate limit.


Have you tried including the recaptcha v3 library and looking at the distribution of scores? -- https://developers.google.com/recaptcha/docs/v3 -- "reCAPTCHA v3 returns a score for each request without user friction"

It obviously depends on how motivated the scrapers are (i.e. whether their headless browsers are actually headless, and/or doing everything they can to not appear headless, whether Google has caught on to their latest tricks etc. etc.) but it would at least be interesting to look at the score distribution and then see whether you can cut off or slow down the < 0.3 scoring requests (or redirect them to your API docs)


For web scraping specifically, I’ve developed key parts of commercial systems to automatically bypass reCAPTCHA, Arkose Labs (Fun Captcha), etc.

If someone dedicated themselves to it, there’s a lot more that these solutions could be doing to distinguish between humans and bots, but it requires true specialized talent and larger expenses.

Also, for a handful of the companies which make the most popular captcha solutions, I don’t think the incentives align properly to fully segregate human and bot traffic at this time.

I think we’re still very much still picking at the very lowest hanging fruit, both for anti-bot countermeasures and anti-anti-bot (counter-countermeasures).

Personally I believe this will finally accelerate once AI’s can play computer games via a camera, keyboard, and mouse. And when successors GPT-3 / PaLM can participate well in niche discussion forums like HackerNews or the Discord server for Rust.

Until then it’s mainly a cost filter or confidence modification. As long as enough bots are blocked so that the ones which remain are technically competent enough to not stress the servers, most companies don’t care. And as long as the businesses deploying reCAPTCHA are reasonably confident that most of the views they get are humans (even if that belief is false), Google doesn’t have a strong incentive to improve the system.

Reddit doesn’t seem to care much either. As long as the bots which participate are “good enough”, it drives engagement metrics and increases revenue.


Scrapers can pay a commercial service to Mechanical Turk their way through reCAPTCHA. It makes a meaningful difference to scraping costs at scale, but sometimes it's still profitable.


I'd pay for a service to do this for me as an ordinary end user, so i never have to solve a captcha myself again.


You would still have to wait for each captcha to be solved, which might be more frustrating than doing it yourself.


It sounds great, until you have Chinese customers. That’s when you’ll figure out Recaptcha just doesn’t really work in China, and have to begrudgingly ditch it altogether…


Do you know if there's a way to rate limit logged-in users differently than visitors of a site?


rate limiting can be a double edged sword, you can be better off giving a scraper highest bandwidth so they are gone sooner, otherwise somthing like making a zip or other sort of compilation of the site available may be an option.

just what kind of scraper you have is a concern.

does scraper just want a bunch of stock images;

or does scraper have FOMO on web trinkets;

or does scraper want to mirror/impersonate your site.

the last option is the most concerning because then;

scraper is mirroring bcz your site is cool and local UI/UX is wanted;

or is scraper phishing smishing or otherwise duping your users.


Yeah, good points to consider. I think the sites that would be scrapped the most would be where the data is regularly and reliably up-to-date, and a large volume of it at that - so not just one scraper but many different parties may on a daily or weekly basis try to scrap every page.

I feel that ruling should have the caveat that if a fair cost paid API version for getting publicly listed data then the scrapers must legally use that (say no more than 5% more than cost of CPU/bandwidth/etc of the scraping behaviour); ideally a rule too that at minimum there be a delay if they are republishing that data without your permission, so at least you as the platform/source/reason for the data being up-to-date aren't harmed too - which may then kill the source platform over time if regular visitors somehow start going to the competitor publishing the data.


Absolutely you just have to check the session cookie


nginx can be set up to do that using the session cookie.


The problem with many sites (and LinkedIn in particular) is that they whitelist a bunch of specific websites, presumably based on the business interests, but disallow everyone else in their robots.txt. You should either allow all scrapers that respect certain load requirements or allow none. Anything that Google is allowed to see and include in their search results should be fair game.

Here's the end of LinkedIn's robots.txt:

User-agent: * Disallow: /

# Notice: If you would like to crawl LinkedIn, # please email whitelist-crawl@linkedin.com to apply # for white listing.


And this is what the HiQ case hinged on. LinkedIn were essentially selectively applying the computer fraud and abuse act based on their business interests - that was never going to sit well with judges.


Btw, LinkedIn does have an API for things like Sales Navigator. It used some weird partnership program (SNAP) to get into it and it starts at (I think) 1500$/year per user. Still pretty cheap though, I think you’d get the value out of that quite quickly for a >300 people company.


> Even when it's not as big as a DOS, it doesn't seem right to me that I can't tell people they're not welcome to cause this additional system load.

You can tell them. You just can't prosecute them if they don't obey.


> While I have sympathy for what the scrapers are trying to do in many cases, it bothers me that this doesn't seem to address what happens when badly-behaved scrapers cause, in effect, a DOS on the site.

Like when Aaron Swartz spent months hammering JSTOR causing it to become so slow it was almost unusuable, and despite knowing that he was causing widespread problems (including the eventual banning of MIT's entire IP range) actually worked to add additional laptops and improve his scraping speed...all the while going out of his way to subvert MIT's netops group trying to figure out where he was on the network.

JSTOR, by the way, is a non-profit that provides aggregate access to their cataloged archive of journals, for schools and libraries to access journals they would otherwise never be able to afford. In many cases, free access.


The effect on Jstore's revenue would have been negligible.

I'm surprised to see someone so cold and unfeeling about Aaron Swartz. Especially considering the massive injustice with regards to application of the law and sentencing.

> Federal prosecutors, led by Carmen Ortiz, later charged him with two counts of wire fraud and eleven violations of the Computer Fraud and Abuse Act, carrying a cumulative maximum penalty of $1 million in fines, 35 years in prison, asset forfeiture, restitution, and supervised release.


You probably can. On the protocol level with JSON-LD or other rich data packages that generate xml or standardized json endpoints. I did this for an open data portal, and this is something most G7 governments do with their federal open data portals using off the shelf packages (that are worth researching a bit obviously first), particularly in the python and flask world. We were still getting hammered by China at our Taiwanese language subdomain, but that was a different concern


I don't know what kind of data you serve up but perhaps you could serve low quality or inaccurate content from addresses that are guessed from your api. I.e. endpoints not normally reachable in the normal functioning of your web app should return reasonable junk. A mixture of accurate and inaccurate data becomes worthless for bots and worthless data is not worth scraping. Just an idea!


But don't you already have countermeasures to deter DoS attacks or malicious human users (what if someone pays or convinces people to open your site and press F5 repeatedly)?

If not, you should, and the badly-behaved scrapers are actually a good wake-up call.


What have you done to protect the site? Most automation libraries are detectable(puppeteer, even with extra-stealth, selenium, playwright...)

The only library that I know that is more or less undetectable is used by a just a few hundred people...


Colour me interested in this library! :)


Curious, what library is that?


When the original ruling in favor of HiQ came out, it still allowed for LinkedIn to block certain kinds of malicious scraping. LinkedIn had been specifically blocking HiQ, and was ordered to stop doing that.


Implement TLS fingerprinting on your server. People can still fake that if they are determined, but it should cut the abuse way down.


No, nor can we just do it by IP. The bots are MUCH more sophisticated than that. More often than not, it's a cooperating distributed net of hundreds of bots, coming from multiple AWS, Azure, and GCP addresses. So they can pop up anywhere, and that IP could wind up being a real customer next week. And they're only recognizable as a botnet with sophisticated logic looking at the gestalt of web logs.

We do use a 3rd party service to help with this - but that on its own is imposing a 5- to 6-digit annual expense on our business.


> Our annual revenue from the site would put us on the list of top 100 ecommerce sites

and you're sweating a 5- to 6- digit annual expense?

> all our pricing is custom as negotiated with each customer.

> there's a huge number of products (low millions) and many thousands of distinct catalogs

Surely the business model where every customer has individually-negotiated pricing model costs a whole lot to implement, further, it gives each customer plenty of incentive to attempt to learn what other customers are paying for the same products. Given the tiny costs of fighting bots, in comparison, your complaints in these threads here seem pretty ridiculous.


> More often than not, it's a cooperating distributed net of hundreds of bots, coming from multiple AWS, Azure, and GCP addresses.

those are only the low-effort/cheap ones, the more advanced scraping makes use of residential proxies (peoples' pwned home routers, or where they've installed shady VPN software on their PC that turns them into a proxy) to appear to come from legitimate residential last mile broadband netblocks belonging to comcast, verizon, etc.

google "residential proxies for sale" for the tip of an iceberg of a bunch of shady grey market shit.


There's a lot of metadata available for IPs, and that metadata can be used to aggregate clusters of IPs, and that in turn can be datamined for trending activity, which can be used to sift out abusive activity from normal browsing.

If you're dropping 6 figs annually on this and it's still frustrating, I'd be interested in talking with you. I built an abuse prediction system out of this approach for a small company a few years back, it worked well and it'd be cool to revisit the problem.


Have you considered setting up an API to allow the bots to get what they want without hammering your front-end servers?


Yes. And if I could get the perpetrators to raise their hands so I could work out an API for them, it would be the path of least resistance. But they take great pains to be anonymous, although I know from circumstantial evidence that at least a good chunk of it is various competitors (or services acting on behalf of competitors) scraping price data.

IANAL, but I also wonder if, given that I'd be designing something specifically for competitors to query our prices in order to adjust their own prices, this would constitute some form of illegal collusion.


What seems to actually work is to identify the bots and instead of giving up your hand by blocking them, to quietly poison the data. Critically, it needs to be subtle enough that it's not immediately obvious the data is manipulated. It should look like a plausible response, only with some random changes.


What makes you think they would use it?


It's in their interest. I've scraped a lot, and it's not easy to build a reliable process on. Why parse a human interface when there's an application interface available?


TLS fingerprinting is one of the ways minority browsers and OS setups get unfairly excluded. I have an intense hatred of Cloudflare for popularising that. Yes, there are ways around it, but I still don't think I should have to fight to use the user-agent I want.


I don't want to say tough cookies, but if OPs characterization isn't hyperbole("the lion's share of our operational costs are from needing to scale to handle the huge amount of bot traffic."), then it can be a situation where you have to choose between 1) cut off a huge chunk of bots, but upset a tiny percent of users, and improve the service for everyone else, or 2) simply not provide the service at all due to costs.


I don't think it's likely to cause issues if implemented properly. Realistically you can't really build a list of "good" TLS fingerprints because there are a lot of different browser/device combinations, so in my experience most sites usually just block "bad" ones known to belong to popular request libraries and such.


Seems like you could sue the scraper for that, then? If they cause you damages by their unapproved actions, you have a tort claim.


Yes, I think working to accommodate the non-humans along with the humans is the right approach here.

Scrapers have a limited range of IPs, so rate-limiting them and stalling (or dropping) request responses is one way to deal with the DoS scenario.

For my sites, I have placed the majority behind HTTP Basic Auth...


You realistically can't. There are services like [0][1] that mean any IP could be a scraper node.

[0] https://brightdata.com/proxy-types/residential-proxies [1] https://oxylabs.io/products/residential-proxy-pool


> How does Bright Data acquire its residential IPs?

> Bright Data has built a unique consumer IP model by which all involved parties are fairly compensated for their voluntary participation. App owners install a unique Software Development Kit (SDK) to their applications and receive monthly remuneration based on the number of users who opt-in. App users can voluntarily opt-in and are compensated through an ad-free user experience or enjoy an upgraded version of the app they are using for free. These consumers or ‘peers’ serve as the basis of our network and can opt-out at any time. This model has brought into existence an unrivaled, first of its kind, ethically sound, and compliant network of real consumers.

I don't know how they can say with a straight face that this is 'ethically sound'. They have, essentially, created a botnet, but apparently because it's "AdTech" and the user "opts-in" (read: they click on random buttons until they hit one that makes the banner/ad go away) it's suddenly not malware.


NordVPN (Tesonet) has another business doing the same thing. They sell the IP addresses/bandwidth of their NordVPN customers to anyone who needs bulk mobile or residential IP addresses. That's right, installing their VPN software adds your IP address to a pool that NordVPN then resells. Xfinity/Comcast sort of pioneered this with their wifi routers that automatically expose an isolated wifi network called 'xfinity' (IIRC) whether you agree or not.


The Comcast access points do, at least, have the saving grace that they're on a separate network segment from the customer's hardware, and don't share an IP address or bandwidth/traffic limit with the customer.

Tesonet and other similar services (e.g. Luminati) don't have that. As far as anyone -- including web services, the ISP, or law enforcement -- are concerned, their traffic is the subscriber's traffic.


> They sell the IP addresses/bandwidth of their NordVPN customers to anyone who needs bulk mobile or residential IP addresses

I would be interested in a reference for this if you have one.


As other have said (A) there are plenty of countermeasures you can take, but also (B) you are frustrated that you are providing something free to the public and then annoyed at the "wrong" customers are using your product and costing you money. I'm sorry, but this is a failure of your business model.

If we were to analogize this to a non-internet example: (1) A company throws a free concert/event and believes they will make money by alcohol sales. (2) A bunch of sober/non-drinking folks attend the concert but only drink water (3) Company blames the concert attendees for "taking advantage" of them when they really just had poor company policies and a bad business model.

Put things behind authentication and authorization. Add a paywall. Implement DDOS and detection and banning approaches for scrapers. Etc etc.

But don't make something public and then get mad at THE PUBLIC for using it. Behind that machine is a person, who happens to be a member of the public.


Alternatively it could be seen that your juice company offers free samples. Then somebody abuses free and takes gallons home with them to bottle and sell as their own.

That’s what it feels like when someone is scraping your network to bootstrap a competitor.


Again, what you call abuse of free samples, someone else calls a savvy strategy tailored to you're poorly crafted business plan. Have ways to limit the free samples or else it's your fault...


There are certain classes of websites where the proposed solutions aren’t a great fit. For example, a shopping site hiding their catalog behind paywalls or authentication would raise barriers to entry such that a lot of genuine customers would be lost. I don’t think the business model is in general to be blamed here and it’s ok to acknowledge the unfortunate overhead and costs added by site usage patterns (e.g. scraping) that are counter to the expectation.


Have you considered using a cache service like cloudflare?


You could ban their IPs?


IP bans are equivalent to residential door locks. They’re only deterring the most trivial attacks.

In school I needed to scrape a few hundred thousand pages of a proteomics database website. For some reason you had to view each entry one at a time. There was IP throttling which banned you if you made requests too quickly. But slowing the script to 1 request per second would have taken days to scrape the site. So I paid <$5 for a list of 500 proxy servers and distributed it, completing the task in under half an hour.


I agree it’s not perfect. It’s also significantly better than nothing.


It's also completely insufficient despite being better. There are so many scraper services that its just a matter of paying a small amount of money to make use of tens of thousands of IPs spread across ISPs and countries.


Can you share where did you get such a nice proxy for 500 proxies? TIA


Using proxies to hide your identity to get around a denial of access seems to get awfully close to violating the Computer Fraud and Abuse Act(in USA, at least).

I’m surprised your school was okay with it.


Don't worry, I don't live in the USA. Thanks for your concern though.


Have you considered serving a proof-of-work challenge to clients accessing your website? Minimal cost on legit users, but large costs on large-scale web-scraping operations, and it doesn't matter if they split up their efforts across a bunch of IP addresses - they're still going to have to do those computations.

https://en.wikipedia.org/wiki/Hashcash


No thanks, as a user I would stay far away from such websites. This is akin to crypto miners. I don't need them to drive up my electricity costs and also contribute to global warming in the process. It's not worth the cost.


This is completely absurd - anti-spam PoW is not remotely comparable to crypto miners, and the electricity cost will be so far below the noise floor of owning a computer in the first place that you will literally not notice (and neither will the environment), unless website owners are completely insane and set up multi-second challenges (which, they won't).

And, it's absolutely worth the cost - as a website owner, you get to impose costs on botting operations with minimal penalties for normal users and minimal environmental impact. Bots work because the costs of renting an AWS server and scraping websites (or sending spam, whatever) are extremely tiny - adding PoW challenges to everything that could be spammed suddenly massively changes the cost of running those spam operations, and would result in noticeably less spam if deployed widely.

In fact, the net "environmental impact" would be negative, as botters start to shut down operations due to greatly increased operational costs.


You do PoW every time you send an email.


If most of your traffic is bots, is the site even worth running?

This really is akin to the question, “Should others be allowed to take my photo or try to talk to me in public?”

Of course the answer should be yes, the internet is the digital equivalent of a public space. If you make it accessible, anyone should be able to consume.

If you don’t want it scraped add auth!


Do you also believe someone running a drone to follow and photograph you, personally, wherever you go in public would be fair and legal?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: