I’ve been on the other side of this, defending against bots.
Basically: Well-behaved and well-intentioned scraping bots are rare. You’d get a lot of users setting update rates to 60 seconds that did a new login every time and creating as much traffic as 1000 users. Then they’d release the script for integration with something people and suddenly you have 1000 people each creating 1000 times as many login requests as a single user.
Another common problem was forgetting to implement reasonable back off for failures. A lot of newbies write scripts that immediately retry on a tight infinite loop whenever something goes wrong, sending a huge stream of requests to your server if the API changes or when it goes down. Again, multiply this by many users sharing a script and it becomes a problem.
Then of course there are people trying to make a business out of extracting your company’s data, such as putting it in some other website where they can serve ads over your content or whatever (think of all of those StackOverflow scraping websites in Google)
Basically, you can’t investigate the motivations of each individual user. You just block them all.
And unreasonable bot load is a legitimate concern.
What's illegitimate is that "attempting to ban programmatic access" is on the table as a legal redress.
The only way, from a technical moral standpoint, I could see that being remotely reasonable is if there was 1:1 feature and access parity with an API, then being able to legally force agents to use the API.
But critically that's 1:1 feature - if a user can do it, the API offers a method to do it.
And 1:1 access - if an unauthenticated user can do it, then no mandating an account is required for API use. And if any user can do it, then any user will be approved for an API key.
Otherwise, it's just ceding more power to companies.
So what we need to do is make a platform for user's bots that completely prevents them to behave in anyway un human like. then get other platforms to trust this platform, rather than trust individuals and their scripts.
Reminds me of a little battle I got into years ago at work. The thermostat covered like 4 or 5 offices, and they had given us a control to change it on an internal website. It would record your name when you changed it, and then make you wait like 10 minutes to change it again. When I first moved into the office, I noticed that there was a battle between two people that had it doing several degree swings all day long. I sent them both and email and proposed a truce with a temperature in the middle, and they agreed. A few weeks later I noticed the guy who preferred it warmer broke the truce. I do not like it warm. So I wrote a script that would reload the page, check if the temperature was above a certain number, hit the down button, wait 10 minutes, then repeat. Some time after that, it became obvious that the other guy had a script too. But his script had no timeouts in the loop. Eventually the people in charge of the internal site emailed me and asked me to stop. They said they only noticed I was using a script because the other guy's script was breaking the website, and they looked at the logs and saw my responsible script reacting to it all night long. My manager laughed and told me to make the script more human-like. The other guy gave up his temperature tyranny and I let it sit at the truce point again.
> think of all of those StackOverflow scraping websites in Google
You don’t need to scrape stack overflow, you can just download a .zip
That’s one reason why people use it: they can’t just gate you off from the content you’ve created. You and others can (and will/do) have a copy of it all.
You beat me to it, was going to say the same. It's always a few bad actors that try and hammer our servers, gets annoying real fast. I'd honestly block them and move on, I don't have time to investigate every single request. Now to sue someone? That seems like a waste of everyone's time.
For the specific use-case of "badly written scrapers", this might be reasonable, but usually by the point when engineering needs to care about scrapers, other people at the company are involved and just view it as a service theft issue. i.e. "Why waste time and money forcing people to scrape fairly when we can just ban all scrapers?"
Not to mention, actually malicious traffic will find any non-Sybil criterion you use to enforce rate limits and work around it. "Enforce rate limits per User-Agent?" I'm now 10,000 different applications. "Enforce rate limits per IP address?" I'm now 10,000 different compromised residential IP addresses. At some point, distinguishing between well-behaved, buggy-but-legitimate, and outright malicious automated traffic is either impossible or too time-consuming. Upon which point you throw up your hands and say, "Screw it, everyone but Google or a browser is banned."
> "Screw it, everyone but Google or a browser is banned."
Thanks! Why don't malicious actors just spoof browsers?
More generally, I would think that any defense that prevented malicious actors would prevent badly written scrapers, simply because malicious actors can do anything a badly written scraper could do, but can also take more active steps to evade defenses.
(These are honest questions; I have very little knowledge about this.)
Why does it need any investigation of motivation? If you can block then you can as well implement backing-off logic on the server's side and serve all traffic. One of the other comments is right. There is some sort of discrimination on bots that probably will never be resolved until laws are put in place to give them certain equal access rights.
There’s no winning here. Sending back a bunch of 429’s is still part of your API. Sure it’s less expensive to do than the operation the client was probably requesting but it’s not free and it’s stateful. For the kinds of bad actors people are talking about in this thread you still want to blackhole them.
All you're doing is offloading the response to a piece of network hardware. It seems like what you're looking for is a technological solution for load management which you're forsaking in favor of a kludge, then blaming the user.
Does it make sense? Bots rarely make HTTP requests for images, css, video clips, large JS files, custom fonts, etc. Real people do. A well-written bot just seeking some specific data can often complete it's task with less than 1% of the resources that would be sent to a "real" user.
I am not sure what "bots" you are talking about here. When I wrote a web scraper to get my Air Canada point values I used a script that fetches the web page and parses it. It was the only way i could get it to work. I had to steel the session token from the browser cookie in order to make it auth
I guarantee that, with some effort, you could write a script to emulate every HTTP call -- logging in, accepting the cookie (just a value in a Set-Cookie header), and requesting the point values, making sure that cookie value is in your Cookie header. Just because you could "only get it to work" one way, does not mean there isn't a far more efficient way.
Dude, this is my story about my experience. The context provided above is relevant to what I was doing. This has nothing to do with theoretical possibilities that you are speaking about it. You are in the wrong thread.
Those are all static files that are easily (and typically) cached in front of the application. Pulling customer-specific data from an authenticated session taxes the application (and DB) directly.
I think you are over-estimating the use of caching in a lot of industries and a lot of companies. Further, a company that aggressively caches static files should also recognize the benefits of caching their most common database queries. Another replier to my comment mentioned his Air Canada point totals. The original post is about American Airlines. Point balances change infrequently. An airline could easily query every active customer (had a point balance change within the past 6 months) every 6 hours and keep all those values in memory, dramatically reducing individual DB queries. Or not, and instead choose to sue a very popular blogger and builder of a tool used by your best customers, pissing everybody off and looking extremely petty and customer-unfriendly in the process.
True, although the more you lock down, obfuscate, and hide your data, the more the bot-writer is going to use the heavy guns to penetrate you. Points Guy, and Award Wallet are both attempting to provide a service that people, especially American Airlines's most valuable customers, obviously want -- AA could easily work with them, instead of against them (or provide the same service themselves).
> Then of course there are people trying to make a business out of extracting your company’s data,
If I can do this by hand there's no legal reason I can't do it by machine. You can try to defend against it, I guess, but the second you start impacting your obligations to someone else (like disabling their account after they paid you) you are in the wrong.
Accesses and privileges given to you does not always extend to agents acting on your behalf.
Having a driver’s license does not grant someone acting on your behalf the right to drive on public roads if they don’t have their own license. And more directly it also doesn’t grant you the ability to use your autonomous driving software on public roads either.
And just because you have a license doesn’t mean you can drive any vehicle on public roads. It has to be street legal and you need a different license for 18 wheelers.
Basically: Well-behaved and well-intentioned scraping bots are rare. You’d get a lot of users setting update rates to 60 seconds that did a new login every time and creating as much traffic as 1000 users. Then they’d release the script for integration with something people and suddenly you have 1000 people each creating 1000 times as many login requests as a single user.
Another common problem was forgetting to implement reasonable back off for failures. A lot of newbies write scripts that immediately retry on a tight infinite loop whenever something goes wrong, sending a huge stream of requests to your server if the API changes or when it goes down. Again, multiply this by many users sharing a script and it becomes a problem.
Then of course there are people trying to make a business out of extracting your company’s data, such as putting it in some other website where they can serve ads over your content or whatever (think of all of those StackOverflow scraping websites in Google)
Basically, you can’t investigate the motivations of each individual user. You just block them all.