No, nor can we just do it by IP. The bots are MUCH more sophisticated than that. More often than not, it's a cooperating distributed net of hundreds of bots, coming from multiple AWS, Azure, and GCP addresses. So they can pop up anywhere, and that IP could wind up being a real customer next week. And they're only recognizable as a botnet with sophisticated logic looking at the gestalt of web logs.
We do use a 3rd party service to help with this - but that on its own is imposing a 5- to 6-digit annual expense on our business.
> Our annual revenue from the site would put us on the list of top 100 ecommerce sites
and you're sweating a 5- to 6- digit annual expense?
> all our pricing is custom as negotiated with each customer.
> there's a huge number of products (low millions) and many thousands of distinct catalogs
Surely the business model where every customer has individually-negotiated pricing model costs a whole lot to implement, further, it gives each customer plenty of incentive to attempt to learn what other customers are paying for the same products. Given the tiny costs of fighting bots, in comparison, your complaints in these threads here seem pretty ridiculous.
> More often than not, it's a cooperating distributed net of hundreds of bots, coming from multiple AWS, Azure, and GCP addresses.
those are only the low-effort/cheap ones, the more advanced scraping makes use of residential proxies (peoples' pwned home routers, or where they've installed shady VPN software on their PC that turns them into a proxy) to appear to come from legitimate residential last mile broadband netblocks belonging to comcast, verizon, etc.
google "residential proxies for sale" for the tip of an iceberg of a bunch of shady grey market shit.
There's a lot of metadata available for IPs, and that metadata can be used to aggregate clusters of IPs, and that in turn can be datamined for trending activity, which can be used to sift out abusive activity from normal browsing.
If you're dropping 6 figs annually on this and it's still frustrating, I'd be interested in talking with you. I built an abuse prediction system out of this approach for a small company a few years back, it worked well and it'd be cool to revisit the problem.
Yes. And if I could get the perpetrators to raise their hands so I could work out an API for them, it would be the path of least resistance. But they take great pains to be anonymous, although I know from circumstantial evidence that at least a good chunk of it is various competitors (or services acting on behalf of competitors) scraping price data.
IANAL, but I also wonder if, given that I'd be designing something specifically for competitors to query our prices in order to adjust their own prices, this would constitute some form of illegal collusion.
What seems to actually work is to identify the bots and instead of giving up your hand by blocking them, to quietly poison the data. Critically, it needs to be subtle enough that it's not immediately obvious the data is manipulated. It should look like a plausible response, only with some random changes.
It's in their interest. I've scraped a lot, and it's not easy to build a reliable process on. Why parse a human interface when there's an application interface available?
We do use a 3rd party service to help with this - but that on its own is imposing a 5- to 6-digit annual expense on our business.