Hacker News new | past | comments | ask | show | jobs | submit login

Reading your comment my impression is that this is either an exaggeration or a very unique type of site if bots make up the majority of traffic to the point that scrapers are anywhere near the primary load factor.

Would someone let me know if I’m just plain wrong in this assumption? I’ve run many types of sites and scrapers have never been anywhere close to the main source of traffic or even particularly noticeable compared to regular users.

Even considering a very commonly scraped site like LinkedIn or Craigslist - for any site of any magnitude like this public pages are going to be cached so additional scrapers are going to have negligible impact. And a rate limit is probably one line of config.

I’m not saying you are necessarily wrong, but I can’t imagine a scenario that you’re describing and would love to hear of one.




Bots are an incredibly large source of traffic on the non-profit academic cultural heritage site I work on. It gets very little human traffic compared to a successful for-profit site.

But the bots on my site -- at least the obvious ones that lead me to say they are a large source of traffic -- are all well-behaved, with good clear user-agents, and they respect robots.txt, so I could keep them out if I wanted.

I haven't wanted because, why? I have modified the robots.txt to keep the bots out of some mindless loops trying every combination of search criteria to access a combinatorial expansion of every possible search results page. That was doing neither of us any good, was exceeding the capacity of our papertrail plan (which is what brought it to our attention) -- and every actual data page is available in a sitemap that is available to them if they want it, they don't need to tree-search every possible search results page!

In some cases I've done extra work to change URL patterns so I could keep them out of such useless things with a robots.txt more easily, without banning them altogether. Because... why not? The more exposure the better, all our info is public. We like our pretty good organic Google SEO, and while I don't think anyone else is seriously competing with google, I don't want to privilege google and block them out either.


As another example, I used to work on a site that was roughly hotel stays. A regular person might search where to stay in small set of areas, dates and usually the same number of people.

Bots would routinely try to scrape pricing for every combination of {property, arrival_date, departure_date, num_guests} in the next several years. The load to serve this would have been vastly higher than real customers, but our frontend was mostly pretty good at filtering them out.

We also served some legitimate partners that wanted basically the same thing via an API... and the load was in fact enormous. But at least then it was a real partner with some kind of business case that would ultimately benefit us, and we could make some attempt to be smart about what they asked for.


It's a B2B ecommerce site. Our annual revenue from the site would put us on the list of top 100 ecommerce sites [1] (we're not listed because ecommerce isn't the only businesss we do. With that much potential revenue to steal from us, perhaps the stakes are higher.

As described elsewhere, rate limiting doesn't work. The bots come from hundreds to thousands of separate IPs simultaneously, cooperating in a distributed fashion. Any one of them is within reasonable behavioral ranges.

Also, caching, even through a CDN doesn't help. As a B2B site, all our pricing is custom as negotiated with each customer. (What's ironic is that this means that the pricing data that the bots are scraping isn't even representative - it only shows what we offer walkup, non-contract customers.) And because the pricing is dynamic, it also means that the scraping to get these prices is one of the more computationally expensive activities they could do.

To be fair, there is some low-hanging fruit in blocking many of them. Like, it's easy to detect those that are flooding from a single address, or sending SQL injection attacks, or just plain coming from Russia. I assume those are just the script kiddies and stuff. The problem is that it still leaves a whole lot of bad actors once these are skimmed off the top.

[1] https://en.wikipedia.org/wiki/List_of_largest_Internet_compa...


If the queries are expensive because of custom negotiated prices and these bots are scraping the walkup prices, can you not just shard out the walkup prices and cache them?

Being on that list puts the company's revenue at over $1 billion USD. At a certain point it becomes cheaper and easier to fix the system to handle the load.


Indeed - this is one of the strategies we're considering.


> As a B2B site, all our pricing is custom as negotiated with each customer ... the pricing is dynamic

So your company is deliberately trying to frustrate the market, and doesn't like the result of third parties attempting to help market efficiency? It seems like this is the exact kind of scraping that we generally want more of! I'm sorry about your personal technical predicament, but it doesn't sound like your perspective is really coming from the moral high ground here.


So your company is deliberately trying to frustrate the market, and doesn't like the result of third parties attempting to help market efficiency?

No. First, we as a middleman resseller MUST provide custom prices, at least to a certain degree. Consider that it's typical for manufacturers to offer different prices to, e.g., schools. This is reflected by offering to us (the middleman) a lower cost, which we pass on to applicable customers. Further, the costs and prices vary similarly from one country to another. Less obviously, many manufacturers (e.g., Microsoft, Adobe, HP) offer licensing programs that entitle those enrolled to purchase their products at a lower cost. So if nothing else, the business terms of the manufacturers whose products we sell necessitates a certain degree of custom pricing.

Second, it seems strange to characterize as "frustrating the market" what we're doing when we cooperate with customers who want to structure their expenses in different ways - say, getting a better deal on expensive products that can be classified as "capital expenses" while allowing us to recover some of that revenue by charging them somewhat more for the products that they'd classify as operational expenses.


You're just describing a cooperative effort to obfuscate pricing and frustrate a market. So sure, your company could be blameless and the manufacturers are solely responsible for undermining price signals. I've still described the overall dynamic that your company is participating in. It's effectively based around closed world assumptions of information control, and so it's not surprising that it conflicts with open world ethos like web scraping.

> it seems strange to characterize as "frustrating the market" what we're doing when we cooperate with customers who want to structure their expenses in different ways

I'm characterizing the overall dynamic of keeping market price discovery from working as effectively. How you may be helping customers in other ways is irrelevant.


Holy moly doesn’t that sound more like tax evasion or fraudulent accounting than financial planning?

They’re trying to pay less tax by convincing your company to put a different price on products they buy based on their tax strategy.

It sounds illegal.


The majority of tax and other civil laws are basically full of things that are illegal/problematic if you do them individually, but if you can find someone else to cooperate with then it becomes fine.


This makes a lot of sense to me.

What do you think about their other assertion that the search page is getting a gigantic number of hits that a/ cannot be cached and b/ cannot be rate limited because they're using a botnet?


I'm guessing the bots are hitting the search page because it contains the most amount of information per hit, and that the caching problems are exactly due to these dynamically generated prices or other such nonsense. After all, the fundamental goal of scraping is to straightforwardly enumerate the entire dataset.

The scale of the botnet sounds like an awfully determined and entrenched adversary, likely arising because this company has been frustrating the market for quite some time. A good faith API wouldn't make the bots change behavior tomorrow, but they certainly would if there were breaking page format changes containing a comment linking to the API.


Thanks for the explanation!

The thing I still don’t understand is why (edit server not cdn) caching doesn’t work - you have to identify customers somehow, and provide everyone else a cached response at the server level. For that matter, rate limit non-customers also.


The pages getting most of the bot action are search and product details.

Search results obviously can't be cached, as it's completely ad hoc.

Product details can't be cached either, or more precisely, there are parts of each product page that can't be cached because

* different customers have different products in the catalog

* different products have different prices for a given product

* different products have customer-specific aliases

* there's a huge number of products (low millions) and many thousands of distinct catalogs (many customers have effectively identical catalogs, and we've already got logic that collapses those in the backend)

* prices are also based on costs from upstream suppliers, which are themselves changing dynamically.

Putting all this together, the number of times a given [product,customer] tuple will be requested in a reasonable cache TTL isn't very much greater than 1. The exception being for walk-up pricing for non-contract users, and we've been talking about how we might optimize that particular cases.


Ahhhhh, search results makes a whole lot more sense! Thank you. Search can't be cached and the people who want to use your search functionality as a high availability API endpoint use different IP addresses to get around rate limiting.

The low millions of products also makes some sense I suppose but it's hard to imagine why this doesn't simply take a login for the customer to see the products if they're unique to each customer.

On the other hand, I suspect the price this company is paying to mitigate scrapers is akin to a drop of water in the ocean, no? As a percent of the development budget it might seem high and therefore seem big to the developer, but I suspect the CEO of the company doesn't even know that scrapers are scraping the site. Maybe I'm wrong.

Thanks again for the multiple explanations in any case, it opened my eyes to a way scrapers could be problematic that I hadn't thought about.


Good explanation, thank you.

I would think that artificially slowing down search results can discourage part of the bots. Humans don't care much it a starch finishes in 5 seconds and not 2 AFAIK.

Especially on backends where each request is a relatively cheap operations wise (especially when each request is a green thread like in Erlang/Elixir), I think you can score a win against the bots.

Have you attempted something like this?


This is really interesting but they’re using a network of bots already - even if you put a spinner that makes them wait a couple seconds the scrapers would just make more parallel requests no?


Yes, they absolutely will, but that's the strength of certain runtimes: green threads (i.e. HTTP request/response sessions in this case) cost almost nothing so you can hold onto 5-10 million of them on a VPS with 16-32 GB RAM, easily.

I haven't had to defend against extensive bot scraping operations -- only against simpler ones -- but I've utilized such a practice in my admittedly much more limited experience, and was actually successful. Not that the bots gave up but their authors realized they can't accelerate the process of scraping data so they dialed down their instances, likely to save money from their own hosting bills. Win-win.

Apologies, I don't mention to lecture you, just sharing a small piece of experience. Granted that's very specific to the backend tech but what the heck, maybe you'll find the tidbit valuable.


If you've got a site with a lot of pages, bot traffic can get pretty big. Things like a shopping site with a large number of products, a travel site with pages for hotels and things to do, something to do with movies or tv shows and actors, basically anything with a large catalog will drive a lot of bot traffic.

It's been forever since I worked at Yahoo Travel, but bot traffic was significant then, I'd guess roughly 5-10% of the traffic was declared bots, but Yandex and Baidu weren't agressive crawlers yet, so I wouldn't be terribly surprised if a site with a large catalog that wasn't top 3 with humans would have a majority of traffic as bots. For the most part, we didn't have availability issues as a result of bot traffic, but every once in a while, a bot would really ramp up traffic and cause issues, and we would have to carefully design our list interfaces to avoid bots crawling through a lot of different views of the same list (while also trying to make sure they saw everything in the list). Humans may very well want to have all the narrowing options, but it's not really helpful to expose hotels near Las Vegas starting with the letter M that don't have pools to Google.


I appreciate the response but I’m still perplexed. It’s not about the percent of traffic if that traffic is cached. And rate limiting also prevents any problems. It just doesn’t seem plausible that scrapers are going to DDoS a site per the original comment. I suppose you’d get bad traffic reports and other problems like log noise, but claiming it to be a general form of DDoS really does sound like hyperbole.


> a very unique type of site if bots make up the majority of traffic

Pretty much Twitter and the majority of such websites.


Do you really believe bots make up a significant amount of Twitter’s operating cost? Like I said they’re just accessing cached tweets and are rate limited. How can the bot usage possibly be more than a small part of twitter’s operating cost?


Bandwidth isn't free.


I didn’t say it is free, I said that the bandwidth for bots is negligible compared to that of regular users.


Negligible isn’t free either.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: