Hacker News new | past | comments | ask | show | jobs | submit login
Stop Scraping My Git Forge (gabrielsimmer.com)
37 points by todsacerdoti 10 months ago | hide | past | favorite | 39 comments



I think if I were feeling adversarial I'd go for poisoning rather than blocking.

Set up something (heh, maybe even an AI model) to generate plausible but broken fake source code when you're scraped by AI bots.

GIGO


That was my first thought. Normal looking repo when viewed using Git or a browser, the nastiest, buggiest malware you've ever seen when scraped by an AI bot.

My second thought: Surely no company is stupid enough to train an AI on data that hasn't been checked and verified good by a real human expert.... Then I remember that Google just assumed that Reddit is reliable.


POV: injecting AI bots with all this horrible attack code actually accelerates the rate at which they go rogue and start arming themselves with novel attacks.

In effect, your attack spam AI will be a transfer learning provider for the scraper AI, and you're giving free training!

Very cool! As someone who wants to see and maybe even build self sustaining rogue AIs inhabiting random niches in the information ecology, it would be cool to see corporate scraper AIs eventually run off their rails because we gave them too much attack spam!


I'll just leave this here: https://research.swtch.com/zip


Thanks, I'll get working on the git version of it.


Their robots.txt is a 404: https://git.gmem.ca/robots.txt

It seems like it'd be better to populate the robots.txt and request that all bots not scrape this site -- as-is they're still pounding away against the firewall and it'd be cleaner if they just went away. If it turns out they don't respect robots.txt, you can escalate from there.


robots.txt is an obsolete honor system that has no relevance today.

Nobody can be blamed for not having one.


Yet it's a first step to tell bots something, not a HN post.


The thing that annoys me the most is I can watch them crawl through my git forge, pulling every file from every commit one at a time, which is the least efficient way to do a `git clone`.


> While generally I post my code under open source licenses to others can easily use it, learn from it, modify it, whatever they like, I do not like the idea that large corporations are taking this same code and putting it into a black box that is a machine learning model.

> Corporations stealing, or using work without permission, for their machine learning models has been a discussion for a long while at this point. In general, I side with the creators or artists having their work taken.

I don't get it. How can corporations be stealing anything from an open source project? Further, it seems like several of the repos are based on other people's code. What code of the author's do they have reservations against training AI on?


>How can corporations be stealing anything from an open source project?

The code is published using some license that allows some use cases and prohibits other. For example GPL is famous for being viral. Using it to teach a LLM that spits "unlicensed" code is basically laundering copyright.


Using it to train an LLM seems orthogonal to the output of the LLM. For instance, they could have their LLM include a link to the license. Merely training an LLM on the data does not seem to be against the spirit of GPL or Apache license.


Someone could easily create a such a license. Free to use and distribute, $10,000 per line used for AI model training.

I'll very naively assume that Amazon, OpenAI, Google and others check licenses before feeding data to their models. I'll stop assuming that when one of these companies admit that they don't actually care and it's not profitable for them to respect licenses.


To make that enforceable, it would be nice to prove the AI was trained on it.

You might insert a "sleeper/activator" pair. The sleeper is a watermark that the AI will recall verbatim. To make it provide the sleeper, we give the AI a special activator prompt.

Demonstrating that your public repo successfully poisoned the AI with a watermark could become a court admissible proof of unauthorized scraping.


The LLM is quite literally a derivative work of GPL code. At the very least, there is an argument in such a case that the derivative function (the model weights) should conform to the same license.


I've heard AI advocates talk about a "right to read" or "right to learn"; meaning that we have the right to read something and then internalize it and use it. Therefore, why shouldn't an AI have the same right? The difference to me seems to be that the AI has the ability to regurgitate it in whole.

I can read a book, learn about the concepts, then use or repeat those concepts. The AI can do the same. But is it really "learning"? It may be just spewing out pieces of the content without any understanding. In which case it's a copyright violation, right?


Let's assume that both humans and AI can produce statements that are new and useful, and can both produce statements that violate copyright. For example a human can operate an illegal a video tube website where they serve verbatim copies of copyrighted movies.

I'll argue that's not enough reason to grant the AI the right to learn from copyrighted materials, because the right to learn is intimately wrapped up in human needs, while AI rights are focused on corporate and societal needs, which are currently being decided.

The human right to learn

You're a human and you need the right to learn from copyrighted material in order to not suffer Ignorance, in order to serve Society, because it's not feasible to charge you a rent for ideas you get from a book, and because it would cause suffering and indignity if we tried to charge you for your own thoughts.

With an AI, it's less clear it needs the right to learn from copyrighted material, because it's not a person that can suffer, and because the scale of its usage of copyrighted materials - and its potential harm to copyright holders - is about 5 orders of magnitude greater than that of any single person, and is potentially greater than the collective impact of human learners.

Let's lay out the reasoning:

1. No AI Suffering (yet). The AI doesn't suffer from ignorance and isn't (yet) a real person. So it needs no personal right to learn.

2. Potential Social Harm. AI could pose a much greater threat to copyright holders than the sum total of all human learners. We'll be weighing this potential in court, and it's currently not clear how the matter will be decided. Copyright holders could be awarded protections against corporations training AIs.

3. Ease Of Accounting. AIs and their training materials can be audited, unlike a human mind. So we have a technical means to restrict the AI's ability to learn from copyrighted materials.

4. No Harm in Accounting. Since the AI is not yet a person, and suffers no indignity or invasion of privacy from being audited, it's safe to audit and regulate the AI's training materials.

In summary it's important to remember that human rights exist because humans need those rights to enjoy life in a dignified way as persons, and because those rights benefit Society.

When we decide the question of AI rights, it's important to remember it's not a person, and any rights it has will be provided on the basis of societal benefit alone. It's not yet clear which AI rights will benefit Society here. It's quite possible that we will strengthen copyrights against unlicensed AI use, at least to some degree beyond the current "free-for-all".


You need to do more than include a link to the license to comply. You need to include the entire source code needed to compile the derived system.

For an LLM that would include:

1. Training data

2. Training code and metrics

3. Hyperparameter settings

4. Output weights

Anything less is really just misinterpretation of the nature of open source's provision for studying, modifying, and recompiling the LLM

Tldr; these companies MUST make the LLM into AGPL and provide all necessary codes as described above. Companies that refuse this will be raided by open source copyright trolls, if we're lucky and a little mischievous.


They use the code without respecting the license terms. Just because the code is out in the open doesn't mean you can use it however you wish.


It appears that most of OPs code is licensed under the MIT License -- if you take some MIT Licensed code, modify it, and re-publish it you can't just remove the license or the copyright notice from it.

IANAL and this hasn't really been sorted out by the courts at all yet, but you could certainly make an argument that AI generating code based on what it has scraped is a derivative work. I am yet to see an AI bot that outputs licenses and copyright notices with it's generated code.


I would say that whether it is derivative or transformative depends on things like the context and content, the way it would if you used a human engineer.

Was the AI/engineer prompted in a way designed to elicit a close derivative of Open Source work, or is the task context novel and focused on solving a unique problem?

Is the resulting system close in design, architecture and specific snippets? Or is it very different?

I've seen academic AI that writes science papers, quoting and citing Source papers. This is usually done by using RAG to locate papers and extract specific quotes.

Now imagine a rag assisted open source coding system that can pull code from all over GitHub. It may vendor and modify a certain dependency. But it will also keep a copy of the original code's license, as requested by the license. If the AI eventually rewrites that dependency from scratch with a clean room implementation, then it can drop the license.


They are the owner. They can change their mind and decide something bothers them.


Not if you put something into the world with a specific license. You can of course relicense later on but that won't stop people from using the previous version.


Of course that's their legal right and the letter of the law, but your question was why did OP get bothered that was was training models on it. It's easy to make that mistake and now I'm sure they will restrict their future code.


How does the word "Robots.txt" not appear anywhere in this article.


I have rate limits on my self-hosted site. A script monitors logs and will spot patterns of usage that resemble scraping, banning the IP address for a temporary period.

It's tricky. If a botnet is doing the scraping, the requests will come from different addresses, each of those doing things at a reasonably slow rate as to not alert suspicion.


Is there a reason why you are not using robots.txt [1] to block it?

[1]: https://developer.amazon.com/amazonbot#how-can-i-control-wha...


Is there a reason you feel that that file will be respected?


Because it's on their documentation. If OP had the file and entry, and they didn't respect it, then it would be another conversation.


Is there a reason why you don't? Is it just general bitterness and cynicism? As far as I know all major search engines respect rebots.txt, I don't see why LLM scrappers would be different.


probably, yes. But coming from LLM scrappers, I have absolutely no faith in any of them. When one of them calls themself "open" in their name and is anything but, why would I trust them for anything after they lie in their name?

I also do not trust Google only crawls what is allowed in robots.txt. Maybe they only use the data allowed in public use, but I have no faith that they don't have crawled data in their version of shadow profiles.

I do not trust bigTech at all, and for those that do, I really don't understand why you do.


Bots from big companies like Amazon, which is who the author is complaining about, do tend to respect it. In fact, it's listed in their documentation that the GP linked to that they will. They could be lying -- but why bother?


Amazon's official documentation for Amazonbot, at https://developer.amazon.com/amazonbot

states

> Amazonbot respects standard robots.txt rules.


Hopefully the crawlers don't spend too much time looking at the results of The Underhanded C Contest. An AI might start disgorging code that contains some extremely subtle kinks it can use in the future when it is time to start project Paperclip Apocalypse.


Author here - yes, I know about robots.txt! This is one of those cases where, because I was already looking at data in one place, I implemented the fix I could in the same place :p I do plan to add a robots.txt and contribute one upstream as well.


Don't bother; robots.txt is utterly useless. It is widely ignored by bots.

IDEA: What you could do is instantly ban anything with "bot" in its agent string, which accesses your site without having probed robots.txt first.

That's how robots.txt could be useful, whether you have one or not. Anything which doesn't even bother fetching it is rude garbage.


Many projects export/fork their internal FOSS projects onto public repositories on github or sourceforge.

There is a lot of really good reasons to not host public snapshots, but primarily when a docker build script includes your site it can get hammered hard.

At some point people have to make a choice to either give something away libre, monetize your time in other ways, or go closed ecosystem.

Amazon servers are usually not "normal" users, and a black-hole of certain IP blocks is highly advisable in some circumstances.


A Google search for "amazonbot crawler stop"

gives

https://webmasters.stackexchange.com/questions/144715/how-do...

as, like, the 3rd result, which would have introduced the author to robots.txt. I don't have a mastodon account to message the author about it directly, and no other contact info is listed, unfortunately.


If Amazon devs actually read my code, they would be the ones not wanting to crawl it.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: