Hacker Newsnew | past | comments | ask | show | jobs | submit | jimmaswell's commentslogin

What exactly is so bad about AI crawlers compared to Google or Bing? Is there more volume or is it just "I don't like AI"?

If you want my help training up your billion dollar model then you should pay me. My content is for humans. If you're not a human you are an unwelcome burden.

Search engines, at least, are designed to index the content, for the purpose of helping humans find it.

Language models are designed to filch content out of my website so it can reproduce it later without telling the humans where it came from or linking them to my site to find the source.

This is exactly the reason that "I just don't like 'AI'." You should ask the bot owners why they "just don't like appropriate copyright attribution."


> copyright attribution

You can't copyright an idea, only a specific expression of an idea. An LLM works at the level of "ideas" (in essence - for example if you subtract the vector for "woman" from "man" and add the difference to "king" you get a point very close to "queen") and reproduces them in new contexts and makes its own connections to other ideas. It would be absurd for you to demand attribution and payment every time someone who read your Python blog said "Python is dynamically type-checked and garbage-collected". Thankfully that's not how the law works. Abusive traffic is a problem, but the world is a better place if humans can learn from these ideas with the help of ChatGPT et al. and to say they shouldn't be allowed to just because your ego demands credit for every idea someone learns from you is purely selfish.


LLMs quite literally work at the level of their source material, that's how training works, that's how RAG works, etc.

There is no proof that LLMs work at the level of "ideas", if you could prove that, you'e solve a whole lot of incredibly expensive problems that are current bottlenecks for training and inference.

It is a bit ironic that you'd call someone wanting to control and be paid for the thing they themselves created "selfish", while at the same time writing apologia on why it's okay for a trillion dollar private company to steal someone else's work for their own profit.

It isn't some moral imperative that OpenAI gets access to all of humanity's creations so they can turn a profit.


As a reference on the volume aspect: I have a tiny server where I host some of my git repos. After the fans of my server spun increasingly faster/louder every week, I decided to log the requests [1]. In a single week, ClaudeBot made 2.25M (!) requests (7.55GiB), whereas GoogleBot made only 24 requests (8.37MiB). After installing Anubis the traffic went down to before the AI hype started.

[1] https://types.pl/@marvin/114394404090478296


Same, ClaudeBot makes a stupid amount of requests on my git storage. I just blocked them all on Cloudflare.

As others have said, it's definitely volume, but also the lack of respecting robots.txt. Most AI crawlers that I've seen bombarding our sites just relentlessly scrape anything and everything, without even checking to see if anything has changed since the last time they crawled the site.

Yep, AI scrapers have been breaking our open-source project gerrit instance hosted at Linux Network Foundation.

Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me. This should be a solved problem. But it looks like this field is full of wrongly behaving companies with complete disregards toward common goods.


>Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me.

a mix of ignorance, greed, and a bit of the tragedy of the commons. If you don't respect anyone around you, you're not going to care about any rules or ettiquite that don't directly punish you. Society has definitely broken down over the decades.


Volume, primarily - the scrapers are running full-tilt, which many dynamic websites aren't designed to handle: https://pod.geraspora.de/posts/17342163

Why not just actually rate-limit everyone, instead of slowing them down with proof-of-work?

My understanding is that AI scrapers rotate IPs to bypass rate-limiting. Anubis requires clients to solve a proof-of-work challenge upon their first visit to the site to obtain a token that is tied to their IP and is valid for some number of requests -- thus forcing impolite scrapers to solve a new PoW challenge each time they rotate IPs, while being unobtrusive for regular users and scrapers that don't try to bypass rate limits.

It's like a secondary rate-limit on the ability of scrapers to rotate IPs, thus allowing your primary IP-based rate-limiting to remain effective.


Earlier today I found we'd served over a million requests to over 500,000 different IPs.

All had the same user agent (current Safari), they seem to be from hacked computers as the ISPs are all over the world.

The structure of the requests almost certainly means we've been specifically targeted.

But it's also a valid query, reasonably for normal users to make.

From this article, it looks like Proof of Work isn't going to be the solution I'd hoped it would be.


The math in the article assumes scrapers only need one Anubis token per site, whereas a scraper using 500,000 IPs would require 500,000 tokens.

Scaling up the math in the article, which states it would take 6 CPU-minutes to generate enough tokens to scrape 11,508 Anubis-using websites, we're now looking at 4.3 CPU-hours to obtain enough tokens to scrape your website (and 50,000 CPU-hours to scrape the Internet). This still isn't all that much -- looking at cloud VM prices, that's around 10c to crawl your website and $1000 to crawl the Internet, which doesn't seem like a lot but it's much better than "too low to even measure".

However, the article observes Anubis's default difficulty can be solved in 30ms on a single-core server CPU. That seems unreasonably low to me; I would expect something like a second to be a more appropriate difficulty. Perhaps the server is benefiting from hardware accelerated sha256, whereas Anubis has to be fast enough on clients without it? If it's possible to bring the JavaScript PoW implementation closer to parity with a server CPU (maybe using a hash function designed to be expensive and hard to accelerate, rather than one designed to be cheap and easy to accelerate), that would bring the cost of obtaining 500k tokens up to 138 CPU-hours -- about $2-3 to crawl one site, or around $30,000 to crawl all Anubis deployments.

I'm somewhat skeptical of the idea of Anubis -- that cost still might be way too low, especially given the billions of VC dollars thrown at any company with "AI" in their sales pitch -- but I think the article is overly pessimistic. If your goal is not to stop scrapers, but rather to incentivize scrapers to be respectful by making it cheaper to abide by rate limits than it is to circumvent them, maybe Anubis (or something like it) really is enough.

(Although if it's true that AI companies really are using botnets of hacked computers, then Anubis is totally useless against bots smart enough to solve the challenges since the bots aren't paying for the CPU time.)


If the scraper scrapes from a small number of IPs they're easy to block or rate-limit. Rate-limits against this behaviour are fairly easy to implement, as are limits against non-human user agents, hence the botnet with browser user agents.

The Duke University Library analysis posted elsewhere in the discussion is promising.

I'm certain the botnets are using hacked/malwared computers, as the huge majority of requests come from ISPs and small hosting providers. It's probably more common for this to be malware, e.g. a program that streams pirate TV, or a 'free' VPN app, which joins the user's device to a botnet.


Why haven't they been sued and jailed for DDoS, which is a felony?

Criminal convictions in the US require a standard of proof that is "beyond a reasonable doubt" and I suspect cases like this would not pass the required mens rea test, as, in their minds at least (and probably a judge's), there was no ill intent to cause a denial of service... and trying to argue otherwise based on any technical reasoning (e.g. "most servers cannot handle this load and they somehow knew it") is IMO unlikely to sway the court... especially considering web scraping has already been ruled legal, and that a ToS clause against that cannot be legally enforced.

coming from a different legal system so please forgive my ignorance: Is it necessary in the US to prove ill intent in order to sue for repairs? Just wondering, because when I accidentally punch someones tooth out, I would assume they certainly are entitled to the dentist bill.

>Is it necessary in the US to prove ill intent in order to sue for repairs?

As a general rule of thumb: you can sue anyone for anything in the US. There are even a few cases where someone tried to sue God: https://en.wikipedia.org/wiki/Lawsuits_against_supernatural_...

When we say "do we need" or "can we do" we're talking about the idea of how plausible it is to win case. A lawyer won't take a case with bad odds of winning, even if you want to pay extra because a part of their reputation lies on taking battles they feel they can win.

>because when I accidentally punch someones tooth out, I would assume they certainly are entitled to the dentist bill.

IANAL, so the boring answer is "it depends". reparations aren't guaranteed, but there's 50 different state laws to consider, on top of federal law.

Generally, they are not entitled to pay for damages themselves, but they may possibly be charged with battery. Intent will be a strong factor in winning the case.


Manslaughter vs. murder. Same act, different intent, different stigma, different punishment

There's an angle where criminal intent doesn't matter when it comes to negligence and damages. They have to had known that their scrapers would cause denial of service, unauthorized access, increased costs for operators, etc.

That's not a certain outcome. If you're willing to do this case, I can provide access logs and any evidence you want. You can keep any money you win plus I'll pay a bonus on top! Wanna do it?

Keep in mind I'm in Germany, the server is in another EU country, and the worst scrapers overseas (in China, USA, and Singapore). Thanks to these LLMs there is no barrier to have the relevant laws be translated in all directions I trust that won't be a problem! :P


> criminal intent doesn't matter when it comes to negligence and damages

Are you a criminal defense attorney or prosecutor?

> They have to had known

IMO good luck convincing a judge of that... especially "beyond a reasonable doubt" as would be required for criminal negligence. They could argue lots of other scrapers operate just fine without causing problems, and that they tested theirs on other sites without issue.


I thought only capital crimes (murder, for example) held the standard of beyond a reasonable doubt. Lesser crimes require the standard of either a "Preponderance of Evidence" or "Clear and Convincing Evidence" as burden of proof.

Still, even by those lesser standards, it's hard to build a case.


It's civil cases that have the lower standard of proof. Civil cases arise when one party sues another, typically seeking money, and they are claims in equity, where the defendant is alleged to have harmed the plaintiff in some way.

Criminal cases require proof beyond a reasonable doubt. Most things that can result in jail time are criminal cases. Criminal cases are almost always brought by the government, and criminal acts are considered harm to society rather than to (strictly) an individual. In the US, criminal cases are classified as "misdemeanors" or "felonies," but that language is not universal in other jurisdictions.


Thank you.

No, all criminal convictions require proof beyond a reasonable doubt: https://constitution.congress.gov/browse/essay/amdt14-S1-5-5...

>Absent a guilty plea, the Due Process Clause requires proof beyond a reasonable doubt before a person may be convicted of a crime.


Proof or a guilty plea, which is often extracted from not guilty parties due to the lopsided environment of the courts

Thank you.

Many are using botnets, so it's not practical to find out who they are.

Then how do we know they are OpenAI?

High volume and inorganic traffic patterns. Wikimedia wrote about it here: https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...

they seem to be written by either idiots and/or people that don't give a shit about being good internet citizens

either way the result is the same: they induce massive load

well written crawlers will:

  - not hit a specific ip/host more frequently than say 1 req/5s
  - put newly discovered URLs at the end of a distributed queue (NOT do DFS per domain)
  - limit crawling depth based on crawled page quality and/or response time
  - respect robots.txt
  - make it easy to block them

- wait 2 seconds for a page to load before aborting the connection

- wait for the previous request to finish before requesting the next page, since that would only induce more load, get even slower, and eventually take everything down

I've designed my site to hold up to traffic spikes anyway and the bots I'm getting aren't as crazy as the ones I hear about from other, bigger website operators (like the OpenStreetMap wiki, still pretty niche), so I don't block much of them. Can't vet every visitor so they'll get the content anyway, whether I like it or not. But if I see a bot having HTTP 499 "client went away before page finished loading" entries in the access log, I'm not wasting my compute on those assholes. That's a block. I haven't had to do that before, in a decade of hosting my own various tools and websites


Agreed. Smug dismissal of new ideas is such a lazy shortcut to trying to look smart. I'd much rather talk to someone enthusiastic about something than someone who does nothing but sit there and say "this thing sucks" every time something happens even if person #2 is incidentally right a lot of the time.

Smug acceptance of new ideas is such a lazy shortcut to trying to look smart. I'd much rather talk to someone who has objectively analyzed the concept rather than someone who does nothing but sit there and say "this thing is great" for no real reason other than "everyone is using it".

Incidentally, "everyone" is wrong a lot of the time.


While I agree in principle, someone has to make decisions about resource allocation and decide that some ideas are better than others. This takes a finely tuned sense of BS detector and technical feasibility estimation, which I would argue is a real skill. It thus becomes subtly different to be an accurate predictor of success vs default cynic if 95% of ideas aren’t going to work.

> if 95% of ideas aren’t going to work

Well, if only 95% of our ideas don't work, with a little hard work and sacrifice, we are livin' in an optimish paradise.


I wonder if we could make an LLM or other modern machine learning framework finally figure out how to compile to Itanic in an optimized fashion.


No. The problems involved are fundamental:

1) Load/store latency is unpredictable - whenever you get a cache miss (which is unpredictable*), you have to wait for the value to come back from main memory (which is getting longer and longer as CPUs get faster and memory latency roughly stays the same). Statically scheduling around this sort of unpredictable latency is extremely difficult; you're better off doing it on the fly.

2) Modern algorithms for branch prediction and speculative execution are dynamic. They can make observations like "this branch has been taken 15/16 of the last times we've hit it, we'll predict it's taken the next time" which are potentially workload-dependent. Compile-time optimization can't do that.

*: if you could reliably predict when the cache would miss, you'd use that to make a better cache replacement algorithm


> this branch has been taken 15/16 of the last times we've hit it

That is kind of how it worked more than 30 years ago (pre 1995), but not since, at least in OoO CPUs.

In fact it was found that having more than a 2-bit saturating counter doesn't help, because when the situation changes it takes too many bad predictions in a row to get to predictions that, actually, this branch is not being taken any more.

What both the Pentium Pro and PowerPC 604 (the first OoO designs in each family) had was a global history of how you GOT TO the current branch. The Pentium Pro had 4 bits of taken/not taken history for the last four conditional branches and this was used to decide which 2-bit counter to use for a given branch instruction. The PowerPC 604 used 6 bits of history. The Pentium Pro algorithm for combing the branch address with the history (XOR them!) is called "gshare". The PPC604 did something a little bit different but I'm not sure what. By the PPC750 Motorola was using basically the same gshare algorithm as Intel.

There are newer and better algorithms today -- exactly what is somewhat secret in leading edge CPUs -- but gshare is simple and is common in low end in-order and small OoO CPUs to this day. The Berkeley BOOM core uses a 13 bit branch history. I think early SiFive in-order cores such as the E31 and U54 used 10 bits.


Fair point, I oversimplified a bit. Either way, what matter is that it's dynamic.


No, VLIW is fundamentally a flawed idea; OoO is mandatory. "We need better compilers" is purely Intel marketing apologia.


Isn't VILW how a number of GPUs worked internally? That said GPU isn't the same as GPC


Yes, as other noted AMD used VLIW for terscale in the 2000-6000 series. https://en.wikipedia.org/wiki/TeraScale_(microarchitecture)

They are used in a lot of DSP chips too, where you (hopefully) have very simple branching if any and nice data access patterns.


And typically just fast RAM everywhere instead of caches, so you don't have cache misses. (The flip side is that you typically have very _little_ RAM.)


Some older ones, yeah (TeraScale comes to mind) but modern ones are more like RISC with whopping levels of SIMD. It turns out that VLIW was hard for them too.


Countdown timers and ‘Only 4 left’ are often scams, but they should note a few sites like eBay get a pass since for simply giving true facts about the auction.


Isn't an auction itself a giant dark pattern?

Also, ebay mixes auction with buy it now in the same item.


Buy-It-Now combined with an auction is exactly like selling a car listed with O.B.O. (e g. ”$5000 or best offer"). This doesn't seem like a dark pattern to me for either side of the transaction.


I like it a lot. I either want something NAOW and don't mind a slightly higher price than I'd pay if I waited a few days.


Auctions are great price discovery mechanisms. Nothing wrong with that IMO


Only if you consider voluntary interpersonal economics a dark pattern


This one is a stretch. 'Dark pattern' makes me think of something like a burglar hiding in the darkness of shadows or nighttime, not race.

And the website in question is hosted by the Australian government, American censorship doesn't come into the picture..


Thanks, I updated the text putting the US bit in parentheses.

For black-list/white-list replacing with block-list/allow-list (also more descriptive) is a a clearer example of the rationale to change the terminology. In general it is about the whole range of feelings and perceptions around "dark" and how they lead to biases in people, often without being aware. If we become conditioned that uses of "dark" invoke gut feelings of sneaky, shady, illegal, secretive, nefarious, evil, etc. some of that may seep through in how people with dark skin are considered. Whether that is true or not, in any case, the alternative terminology being more descriptive, it is low-hanging fruit to adopt it.


> For black-list/white-list replacing with block-list/allow-list (also more descriptive) is a a clearer example of the rationale to change the terminology.

Sometimes it is more descriptive, but sometimes other words will be more descriptive, too. (Usually the words "blacklist" and "whitelist" are not hyphenated from what I could see, though) Sometimes the list is used to block and allow something, but sometimes other words such as exclude and include will be better. To really be more descriptive you might write e.g. "allow by default but deny whatever is listed", and "deny by default but allow only what is listed", etc.

> If we become conditioned that uses of "dark" invoke gut feelings of sneaky, shady, illegal, secretive, nefarious, evil, etc.

At least to me, it does not. It might be secretive (because, it is dark, it cannot be seen; however, just because it cannot be seen does not necessarily imply that they intend to keep it secret and prevent anyone from knowing what it is), does not necessarily mean it is illegal and nefarious and evil.

> Whether that is true or not, in any case, the alternative terminology being more descriptive, it is low-hanging fruit to adopt it.

I do agree, if you actually do have a better more descriptive terminology, it will be better, although being more descriptive can also make the wording too long, so that can be a disadvantage too. Also, sometimes words are suggested, which do not sound good, or are too similar to the other word.


> Usually the words "blacklist" and "whitelist" are not hyphenated from what I could see, though

Yes, I use blocklist / allowlist myself, without the dashes.

> Sometimes the list is used to block and allow something, but sometimes other words such as exclude and include will be better.

Good example. I agree. Using the most descriptive variant is a good practice then, and no need to fall back to a vaguer container concept.


If someone has cognitive dissonance over hearing the word "dark" and immediately jumps to a racist interpretation, it's really not my problem to fix. Racism exists in many forms, and the road to hell is paved in good intentions. I would argue that avoiding the word "dark" because it reminds you of black people is pretty damned racist.


Indeed there's racism in many forms, shapes, and sizes. At what point should it be addressed, and where? I am certainly no expert here. What I observe in society when it comes to cultural change in general is that often a change is set in by a particular group who trigger a kind of overreaction on the theme by their activism, which in turn leads to severe resistance by others, followed by some 'middle road' becoming the new cultural norm over time. This can take many years.

You saw that with feminism, where at some point many fierce feminists held quite extreme views on the desired role of men in society. The vanguard opened the way, and then during many years feminist ideas started to permeate into every day society. On racism, Black Pete the helper of Saint Nicholas in a yearly children's festivity in the Netherlands, is an example where initially practically no one thought it racist. Until it was made a theme by activists. Now a couple years later about 3 quarters of the country see soot-faced Pete's (from the chimney through which they dispatch gifts), while a third clings to tradition with black face Pete and the argument "it isn't racist, and never was".


Where does it end?

My friend's brother got fired for saying something at work, except HR would not tell him what it was he said. Instead, they gave him a pamphlet filled with "problematic" phrases and suggested alternatives; it was many pages and may or may not have even contained his particular unsanctioned phrase. Who knows.

Included in the pamphlet were phrases such as (with minimal paraphrasing) "that falls on deaf ears", which offends the deaf community, "this is a blind spot", which offends blind people, "we're coming up short on that", which offends short people, "that's a tall order", which offends tall people, and more.

I'm really hard pressed to accept that on the off chance someone gets upset about their height because someone uses distance or length to compare two concepts in a work meeting, that it should be anyone's problem other than that person.

I feel similarly about words like "dark", or "whitelist/blacklist", which have documented nonracial etymology, etc. At some point we draw the line, but we draw it to reject absurdity, not to embrace it.

I'd much prefer we spend all this time and organizational effort actually tackling racial inequality, dismantling racial infrastructure, implementing reparations, etc. instead of finding ever more ways to pat ourselves on the back for minimal effort.


> I'd much prefer we spend all this time and organizational effort actually tackling racial inequality, dismantling racial infrastructure, implementing reparations, etc.

As most people do. Guess all of that will happen simultaneously in the chaotic cauldron of society, including language evolution to that happy middle ground over time when terms find common well-accepted meaning.


I agree that "dark pattern" does not to me think of race, either, but I think that "deceptive design" is a better word anyways.


It is not really about what it directly means. It is about changing the social ideas of white or lighter things meaning good, while black or darker things meaning bad.


I certainly do not want old graphics cards to become ewaste for no good reason.


> you're wondering if the answer the AI gave you is correct or something it hallucinated

Regular research has the same problem finding bad forum posts and other bad sources by people who don't know what they're talking about, albeit usually to a far lesser degree depending on the subject.


Yes but that is generally public, with other people able to weigh in through various means like blog posts or their own paper.

Results from the LLM are your eyes only.


The difference is that llms mess with our heuristics. They certainly aren’t infallible but over time we develop a sense for when someone is full of shit. The mix and match nature of llms hides that.


You need different heuristics for LLMs. If the answer is extremely likely/consistent and not embedded in known facts alarm bells should go off.

A bit like the tropes in movies where the protagonists get suspicious because the antagonists agree to every notion during negotiations because they will betray them anyway.

The LLM will hallucinate a most likely scenario that conforms to your input/wishes.

I do not claim any P(detect | hallucination) but my P(hallucination | detect) is pretty good.


moment's given me no trouble at all. I certainly haven't found it to be full of traps. Addressing the most common complaint: a moment object is mutable, sure - that's a valid design choice, not a trap. Follow the docs and everything works perfectly well IME.


As a specific point, I have not safe found a way to represent a date or time in Moment. When I point this out, I generally get agreement from people who are more used to other languages, and the claim that "You should never be representing a date or time; everything should be a datetime" by JS devs.


moment is far smaller if you include it without locales you don't need.

I don't care how much they talk themselves down on their homepage, begging me to choose a different library - I like it and I'll continue using it.

> We now generally consider Moment to be a legacy project in maintenance mode. It is not dead, but it is indeed done.

> We will not be adding new features or capabilities.

> We will not be changing Moment's API to be immutable.

> We will not be addressing tree shaking or bundle size issues.

> We will not be making any major changes (no version 3).

> We may choose to not fix bugs or behavioral quirks, especially if they are long-standing known issues.

I consider this a strength, not a weakness. I love a library that's "done" so I can just learn it once and not deal with frivolous breaking changes later. Extra bonus that they plan to continue making appropriate maintenance:

> We will address critical security concerns as they arise.

> We will release data updates for Moment-Timezone following IANA time zone database releases.


I couldn't agree more. I have no idea why the moment devs are trying to kill moment.


IMO it's the ISP's who are intentionally misleading people. Average Joe might have some inkling of how big a gigabyte is these days, but nobody except a network engineer cares what a gigabit is. I can't imagine how many people buy gigabit fiber expecting a gigabyte. It would sound much less impressive if it were marketed as 125MB/s like it should be. They should at least be required to show both, not make people convert units if they want to find out how fast their advertised internet is supposed to download their 50GB game.


I don't think that counts as intentionally misleading since bits/second is the correct measurement for any serial connection and has been since the days of Baudot. Joe Blow might misunderstand it but that's on Joe.

It's not like the situation with hard drives where they're going against industry convention for marketing purposes.


You could also blame Windows. Linux counts storage bytes in base 10. But still counts RAM in base 2.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: