What worries me most is that this is likely to only increase the gap between large corporate creator and small independent creators.
Large AI companies probably are not too worried about making deals with corporate content creators. Having access to content from a trigger-happy creator is only going to increase their advantage over competitors, after all. And if those creators were instead to try to introduce legislation, the AI companies would risk losing access to content from small creators without the means to sue too.
We seem to be moving into a world where corporate content cannot in any way be reused, remixed, or even archived. You cannot even own a copy - it is only accessible for a monthly fee and can disappear at any time. Meanwhile, anything created by independent creators is fair game to steal and rip off. Copyright was intended to promote and protect human creativity, but instead we got a rent-seeking mechanism used to stifle original creation.
> And if those creators were instead to try to introduce legislation, the AI companies would risk losing access to content from small creators without the means to sue too.
this is the point of a class action suit isn't it?
if it turns out training isn't fair use then Microsoft/Google/OpenAI will suddenly have class action suits for billions if not trillions of damages against them
($150,000 damages per willful infringement, after all)
Have you never seen the outcome of a class action? They’re all slaps in the wrist, less than speeding tickets, and the action members get like a free hotdog or red bull as compensation if they’re lucky
Yeah that's the real goal. You can make it impossible for big corporations to use this technology because it paints a huge target on them while ignoring the de facto state of open source models that are impossible to enforce regulation against. Best of both worlds maybe.
How shocking! A monopoly right granted by the government to exclude others from use of an idea (patents) and creative works (intellectual property) was intended to help the little guy, but ended up helping the big boys extract rents instead, while exploiting the little guy because they sold their rights to survive and get access to a platform?
It’s almost as if governments (even capitalist ones) work with industry and concentration of power perpetuates this kind of consolidation further. They keep us distracted so we don’t have enough collective willpower to get together and demand reform, or even better — create our own alternative open ecosystems.
How so ? Copyright was introduced along with the print press to protect the works of authors. When you buy a copy of a book it's not exactly private property of the author that you buy, is it ?
This isn’t the case. In Ancient Rome it was a common practice for someone in the audience to note down poets’ performances, then give that transcript to a team of amanuenses who would produce copies for sale, with none of the proceeds going back to the original creator. In the pre-copyright world, no one saw any problem with this practice; as the other poster mentioned, the creator economy was patronage-based. All that the poets objected to (Martial has at least one biting epigram about this) was people putting their own names on the poetry instead of crediting the creator. That is, plagiarism instead of copyright violation.
> In Ancient Rome it was a common practice for someone in the audience to note down poets’ performances
Which was not at scale and in a different model of economy. Since it was the printing press which introduced the issue then what's the point in making an example from times where such issues didn't exist ?
> In the pre-copyright world, no one saw any problem with this practice
Because in the pre-industrial-revolution world the scale of the problem probably wasn't noticeable.
I know works of art were occasionally copied prior to printing press, hence the usually. I guess the modern analogy to your example would be something like CAM/TS films. You go to a show and copy the content. Are you arguing this should be legal ?
I doubt you are playing here with a good-faith definition of “at scale”. In any event, copying of works for sale in Antiquity was certainly of scale, we know that many literary works spread quickly across the Mediterranean through commercial production. Furthermore, some of the earliest printed books were made in such limited editions that Roman mass production can certainly be compared. The development of the printing press is generally viewed in contrast to the medieval manuscript era that immediately preceded it, but that was a time marked by a decline in literary rates and amanuensis workforce since Antiquity.
As for going to a show and copying the content, yes, I would argue that this should be legal. Plenty of people on HN are from cultures that never entirely accepted copyright on entertainment, even if their countries’ governments were pressured to enact copyright legislation.
> I doubt you are playing here with a good-faith definition of “at scale”.
I could say the same thing about you because you make it sound like the appearance of some works in various places on the map is equivalent to their vast abundance.
> development of the printing press is generally viewed in contrast to the medieval manuscript era that immediately preceded it
If you skip renaissance.
> As for going to a show and copying the content, yes
I guess it's one thing to argue something like that from the spectators pov and another from the artists. If you actually have evidence that there are large circles of professional artists who argue their work should be copied at will with no compensation then ok, you are right.
> the appearance of some works in various places on the map is equivalent to their vast abundance.
As I said, historians know that some popular works not only appeared across the map quickly, they were commercially sold in the marketplace such that the educated class was able to purchase their own copies of prominent recent works with, of course, no money going back to the creator. Again, I don’t think your definition of “abundance” is good faith.
With regard to your last point, why should the artists’ desire for compensation outweigh the desire of audiences to consume the media for free, or other artists’ desire to rework prior art for free? This is a moral debate that is quite culturally dependent, and though you want to claim that your views are the right ones, that just won’t fly on a forum as international as HN. Many posters on HN grew up with pirated DVD and cassette/CD stands at the market (some might even still have them where they live), or in their countries Bittorrent or now pirate streaming sites are things used by ordinary people.
Which illustrates what the difference in scale I'm talking about.
> why should the artists’ desire for compensation outweigh the desire of audiences to consume the media for free
Because if you disincentivize the artist there are no media to consume. Why should your desire to consume for free deprive me from consuming at all if there is no artist willing to accept such conditions ?
Ancient rome also used lead for everything including pots, pans, pipes and wine, possibly lead poisoned themselves out of existence, so they might not be the best examples to follow.
I'd guess the point is that prior to the invention of the printing press, an "author" would be financed through a patron. That patron could monetize their initial investment by either hiring people to copy a work by hand or allowing access to the book.
Once the printing press arrived the patron / publisher needed another way to monetize their initial investment as modes of reproduction became more easy, copyright became more restrictive.
Depending on your side of the copyright argument, it either allows whomever is making the initial investment (publisher, author) to be a generous patron of human creative progress ... or ... it allows the to maintain a monopoly on knowledge and be able to profit off it.
> * or ... it allows the to maintain a monopoly on knowledge and be able to profit off it.*
The length of modern copyright terms is absurd and harmful.
The USA started with 14 (plus optionally another 14) which was better.
Last I checked - quite a few year ago - I think there were academic papers calculating that an "optimal" copyright term is probably around 10-14 years.
Copyright in the united states begins in 1776 and even earlier in Europe. At this time the economy consisted almost solely of agriculture as well as some light manufacturing. So it seems highly unlikely that the authors of the constitution saw copyright as "integral to capitalism functioning" rather than as a way to promote the arts.
Regarding your second point:
>The first statutory police force is believed to be the High Constables of Edinburgh, who were created by the Scottish parliament in 1611 to "guard their streets and to commit to ward all person found on the streets after the said hour".[0]
I understand that you will likely view this source through a communist lens, and I wouldn't either take every primary source at face value but having a city watch and balif was common. As goverments became more centralized these were replaced by "police" but it is misleading to say police were simply invented one day, and that it was invented soley with ulterior motives.
> So it seems highly unlikely that the authors of the constitution saw copyright as "integral to capitalism functioning" rather than as a way to promote the arts.
They didn't. GP is speaking out of ignorance and desire to prove that "capitalism bad".
Any time I see the phrase "democratize access," my spidey-sense starts tingling. It's almost never used to describe an action that's an unadulterated good for society. It's USUALLY used to describe something sketchy at best, or even outright evil, with the justification that only "the bad guys" have access currently, and everything would be better if EVERYONE had access.
Look, I get that unethical corporations using this pirated training data for their artist-usurpation machines is bad, right? But EVERYONE being able to dismiss the rights and wishes of current artists while they work to create artist-usurpation machines of their own? That's not any better! You don't need to "democratize access" to that!
Generally speaking, democratizing access is about removing monopoly power that extracts illegitimate rents -- and that tingles in a good way.
Libraries have always been at the forefront of democratizing access, and hasn't that been an unadulterated good? Or whether it's community colleges that democratized access to higher ed, or the deregulation of air travel that democratized air travel through cheaper prices that made it more available to the masses, instead of artificially restricting routes.
I can't think of a single example of "democratizing access" that is "sketchy at best" or "outright evil". They all seem pretty great to me!
But I guess I'm also firmly on the side of AI training here -- I see no reason for additional compensation to authors/artists for training on their data, when anyone can go to a library or museum and then go and create their own works influenced by what they've seen. Who cares if I hire an expensive consultant who's read a lot of books at the library, or a cheap AI who's read a lot of books from Books3? Why would authors/artists deserve extra compensation for the latter but not the former? There's just no clear legal principle behind that.
I mean that's part of the conversation that needs to be had. I would argue libraries are an unadulterated good, but it is generally considered at best unethical and at worst illegal to re-use content that isn't your own, at least without a proper citation.
Then there's also the issue with things like art, music, and code. Where does the line fall with scraping Github, Soundcloud, DeviantArt, or Instagram and using things like that without permission? Most of the code on Github is open source, but there's a lot of difference between the GPL and BSD licenses.
> but it is generally considered at best unethical and at worst illegal to re-use content that isn't your own, at least without a proper citation.
No it's not at all, except in extremely limited circumstances.
When George Lucas made Star Wars, did he cite all the Westerns and space opera serials and movies that influenced him? When you give a presentation at work on why you should move to a sharded database, do you cite the history of academic work on sharded databases? When you use Times New Roman in a document, do you cite the British newspaper The Times, or Robert Granjon's prior serif designs from the 1500's?
Of course not.
Legally, you can do whatever you want with ideas and styles and whatnot, which is what AI is about. Legally, you only run into problems when you reproduce sections of copyrighted works verbatim, without a license, in a manner that's not considered fair use. Your answer to "where does the line fall" is quite clear legally -- it's the line demarcated by fair use, which has nothing to do with licenses. AI doesn't change that.
I am not a lawyer, but it seems right to me to say that the weights are a derivative work of the training set.
> A “derivative work” is a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted. A work consisting of editorial revisions, annotations, elaborations, or other modifications, which, as a whole, represent an original work of authorship, is a “derivative work”.
As I understand it, derivative works must be created with the legal use of the original work, or be fair use, otherwise they are infringing.
No, as you can see from your very definition. But here's a good example:
If you take a book and turn it into a movie, that's a derivative work. Anyone can see the direct resemblance -- the transformation or adaptation.
But if you take a book, convert each letter to a number, add up the numbers that make each sentence, and then sell that as a list of "random" numbers, that's not a derivative work. The end result is sufficiently transformed that copyright no longer applies. Ownership of the original work has no relevance.
And AI weights are like that. They're a complete transformation. They're not a derivate work. The only thing you have to make sure of is that they haven't been overtrained to the extent that they can regurgitate whole chapters of the texts they were trained on, for example. But that's not something they're currently able to do, and obviously copyright law will force companies to ensure it stays that way. (Not to mention that companies would do it anyways, due to the economic motivation of reducing model sizes to cut costs.)
>convert each letter to a number, add up the numbers that make each sentence...The end result is sufficiently transformed that copyright no longer applies
the problem with this as an example is that copyright would not apply to this transformative work, not the original author's copyright nor your new authorship because this transformative work contains no creative human expression (unless the original book was designed to add up to some fortune cookie, of course, in which case you have not transformed it)
A nuttier, chewier example would be retelling a litigious story like Moana ("consider the copyright, across all these leaves... make way!"), from the pig's perspective or something, and seeing what would fly and what wouldn't.
Weights are simply a lossy compression of the training data set.
Now, I understand the argument that perhaps the specific work has been homeopathically diluted down to nothingness in the weights and so therefore has only been used to contextualise the compression process of other works, but if the weights can be reasonably used to generate copyright infringing text (and condensations and abridgements and transformations are explicitly listed in the law, verbatim copying is not necessary), or even answer substantial questions about it, then that shows that the weights included that data.
If I take a sound file and compress it down so it's poor quality but I can still make out the tune, that doesn't mean that I've avoided copyright law.
> Weights are simply a lossy compression of the training data set.
No they're not -- they're more like the dictionary generated to produce a lossless compressed data set. But then we throw out the compressed data itself, and keep only the dictionary.
> but if the weights can be reasonably used to generate copyright infringing text (and condensations and abridgements and transformations are explicitly listed in the law, verbatim copying is not necessary)
First of all, they haven't been shown to substantially generate infringing text that aren't the kinds of short snippets covered by fair use. And my previous comment already explained that longer texts are not going to happen, for both legal and economic reasons.
But secondly, you're wrong about "condensations and abridgements and transformations". You can absolutely sell a page-long summary of a book without getting permission, for instance. What do you think things like CliffsNotes are all about? Or all those two-page "executive summaries" of popular busines books?
You can't abridge a 1,000 page book to 500 pages and sell that, but you can summarize its ideas in a page and sell that. Which is basically the approximate level of understanding that LLM's seem to absorb.
That's how I see it as well. To democratize art, music etc. now means to remove the skill component with the usage of all the work done so far. No one is actually prevented from pursuing those things and if you don't want your art to become training data you're ruining democratization and are somehow against the will of the people.
The level of disrespect for artists I've seen here and on other forums with regards to this technology has been staggering. The entitlement for work you didn't do is so gross.
As a fanfiction writer, I find the entitlement of authors and corporations over work that I created gross. I transformed their work into something else, advertising the original work and making a bigger market for it, and in exchange we get C&Ds and copyright claims. Someone writes a reskin of The Odyssey and then a corporation claims every derivative for 75 years.
I think this is an oversimplification. The concern about entrenching the positions of big tech companies is, I think, genuine, and I do believe it's important to find ways to foster competition and opportunity for smaller players. Possibly the law needs to evolve and/or there need to be licensing solutions for this content that work both for creators, and for those looking to train models (and, in some sense, licensing payments that are "means-tested").
If we are thinking of the same tweet, it was from an influencer who shows off these different tools on his YouTube channel. So of course he's subscribed to all of them. Quite the disingenuous framing
>Equity and Concourse type by Matthew Butterick
>The lawyer suing all the ai companies
Yeah I see you are against giving everyone access to technology now
I wonder how this affects things like the Oxford English Dictionary, wiktionary, and other corpus-based linguistics, which rely on sample sentences of the word usage in order to determine the context.
Because language is an evolving thing, it is almost certain that they have referenced sentences from copyrighted sources. E.g. I'm willing to bet that they have the sentence where Cory Doctorow introduces the term "enshittification". (The OALD 7ed in it's foreword even states "Corpus analysis now makes it possible to draw authentic examples from a vast range of attested contemporary usage. A concordance will display hundreds or thousands of them to choose from.")
I suspect that the inclusion of a few sentences -- especially those that introduce a new word or usage of a word -- are fair use, but the inclusion of the entire texts is not.
This then brings up an interesting point where the computer scientists/linguists developing tools like WordNet or other NLP databases would be at an advantage to those that take the approach of throwing a lot of data into a neural network and hoping for the best. Yes, it is a lot more work/effort to develop those NLP databases, but in the end they may end up being more robust, especially around the question of copyright.
It shouldn't affect corpus linguistics but the other way around, because all these things have long ago (even before computers) been legally contested by authors and publishers w.r.t. what can be done to text by dictionary makers and corpora managers without the authors' permission, so unless new laws get passed, the current precedents establishing what's permissible for corpus linguistics would still be valid and also be relevant for treatment of machine learning models.
In essence, the long established principles for text analysis is that facts about text (concordances, collocation statistics, n-gram counts) are neither copyrightable nor derived work, and thus can be calculated, gathered, used and distributed even if copyright holders of the source data object. Now a court might judge that training a large language model is substantially different or that it's effectively the same, but such a decision wouldn't affect corpus linguistics and how they use sample sentences, only whether LLMs get the same treatment or not.
Sample sentences like those in the OED are pretty much the textbook example of fair use.
They are obviously transformative (the whole "dictionary" part), copy factual information (use of a word, not what the sentence itself is saying), are not substantial (one sentence out of many thousands), and do not impact the original work's value (nobody would buy a dictionary instead of a novel because it contains a sample phrase from that novel). The OED cites its sources, which also strengthens its case.
Compare that to AI, which is more than happy to write a short story to the prompt "Write a story about Bucky and Captain America falling in love, and living happily ever after in a mountain cabin." (Transformative? Maybe. Factual? No. Substantial? Yes. Impacts value? Yes.) Works like the AI's output have been dealt with in lawsuits like Salinger v. Colting, and it simply is not allowed. The big question right now is: what about the AI model itself?
(We noticed The Pile was recently taken offline, so we hosted it: https://thenose.cc. Apparently Books3 was also a part of The Pile, so feel free to download.)
Does anyone have more content about what makes Books3 so special relative to Bibliotik? Was it processed somehow, or just compiled into a single file?
I feel like I’m missing something from this and all of the other articles about Books3. It sounds like he downloaded all of the books from a book piracy site then rehosted them with the “Books3” name. Surely there must be more to the story? Or is the story simply that a professor hosted pirated content under his own name under the guise of AI training?
This is the kind of effort that could have been done anonymously, just as all of the pirated books had already been uploaded and hosted anonymously. I’m not sure why he expected any different outcome by re-pirating everything under his own name.
The journalists seem to be loving it, though. All of the tech journals have an article about this guy.
> Does anyone have more content about what makes Books3 so special relative to Bibliotik?
They are essentially the same content (that is, the documentation for the copy of books3 seoarately hosted in huggingface says that it is all of Bibliotik in plaintext form, presumably as of a particular point in time.)
> This is the kind of effort that could have been done anonymously
Sure, its something each group training an AI could do independently at greater aggregate cost until someone succeeds in taking the original source down, but not only would that be costlier, but it in would involve less transparency and comparability across model architecture, or at least required the transparent, comparable trained version to be different from the full version.
> Sure, its something each group training an AI could do independently at greater aggregate cost until someone succeeds in taking the original source down, but not only would that be costlier, but it in would involve less transparency and comparability across model architecture, or at least required the transparent, comparable trained version to be different from the full version.
I meant he could have uploaded it under a pseudonym rather than broadcasting to the world that he was the one doing the uploading.
>He sees the widespread practice of training AI on copyrighted data as outrageous, and finds it infuriating that this behavior gets defended with claims that it’s democratizing access to information. “Open source doesn’t mean you took a bunch of people’s shit and gave it away for free,” he says. “That's theft.”
>Whether the defendant had purchased a signed copy or flagrantly shoplifted a dog-eared paperback wouldn’t matter during arguments over whether The Bedwetter, Too was a derivative rip-off or a transformative parody.
This strikes at the heart of why this case is about to be laughed out of court.
The argument the plaintiffs are making is that ChatGPT is a "derivative work", i.e. letting people use the software is akin to distributing carbon copies of the book at issue, with at most slight modifications (typical derivative works include translations, screenplay adaptations, etc.).
Since ChatGPT obviously cannot literally produce the full text of the book on command, the very strained position they're trying to advance is that short, several-paragraph summaries constitute a derivative work.
That is to say, they're arguing that writing, say, a review, or a book report, is an act of copyright infringement tantamount to taking a book, translating it into Japanese, and selling that translation.
It's a deeply stupid and wrongheaded argument, and it deserves to die a quick death.
Writing a review or a book report is very much creating a derivative work in copyright law. Copyright law then says that these derivative works are a fair use. It does not follow that other derivative works that you personally feel are as serious are also fair use.
What's next for models like GPT now that a lot of sites will outright block CCBot, GPTBot and others? How big of an impact is this going to have on the LLM itself? Isn't OpenAI in a bit of a pickle in regards to this?
The problem with my question is the following:
Content gets syndicated anyway, so if DigitalOcean blocks GPTBot (which it does), pretty much every single one of those tutorials will be syphoned off to other sites, which are unlikely to block GPTBot themselves. How will DigitalOcean (or any other company) address this?
It looks to me like it's Catch 22 in every direction you look, and unless you're someone like The New York Times who can afford to outright protect the data with licensing...
It's just something thats been on my mind lately but I don't understand the finer details of it.
It's going to be a walled garden dystopia, with everyone and everything asking to sign-up, subscribe or whatever to monetize content.
Scraping & siphoning content, ad blockers, both sides have been in arms race for years, AI and LLMs will just be the last straw, before almost anything of value is behind a pay/subscription wall.
Is all the data on the internet from 2010 to 2025 that much more valuable than all the data on the internet from 2010 to 2022? Better data yes, but more data up to the present day? Do we really need that to keep improving AI?
GPT is already capable of incredible generalized language understanding and I'd wager we've long since hit diminishing returns from raw internet data. RLHF, fine tuning, and better (and more data efficient) architectures are what we need now.
People have somehow conflated artificial intelligence with "oracle that knows everything" and so keeping models up to date with recent knowledge has become a must. Of course, that could be done by fine tuning techniques and better architectures that can outsource knowledge retrieval to tools, all very interesting ongoing work on that topic, but even these approaches require to not be "blocked" in their data retrieval tasks for them to work well.
I'm sure you can do lots of interesting research using outdated datasets but for companies creating products this will not be sufficient.
Language shifts, so NLP models need to understand those. For example, compare the use of "gay" before ~1980 and after. Some words change in spelling, like "to-morrow" present in works around 1800 changing to "tomorrow" in current usage. Some words are also coined, like "woke" or "enshitification".
If your data or models don't account for those then it can make mistakes. For example, if a model is only trained on modern sources (and does not know about Early Modern English 2nd person pronouns "thy"/"thine"/etc.) then it can easily get confused when determining parts of speech, which then affects other down-stream processing.
People involved in AI have incentive to say 'yes' to everything.
Nobody will make anything value in public, that they didn't want to be released for free anyway.
ChatGPT's vacuum has brought back a desire for privacy and will probably contribute to destroying piracy too.
ChatGPT has destroyed the 'study hard and get reward loop' for collaborative effort on the internet. If you use chatGPT, it absorbs all your question data and gives you nothing in return. You can't commit to random people, as they are expected to leak your IP onto gpt.
Isaac Newton using chatGPT would upload the core of calculus to GPT in research questions, and see no personal benefit for doing so.
There is no greater thief in history of academic work, than electronics.
Easy to work around. Contract someone outside of the jurisdiction to provide a dataset, then it's up to them to deliver it to you. You "weren't aware" of the data source until people outside the organization starts shouting or the police starts asking questions, but the model "doesn't contain any of the data" so you continue shipping the model/product.
You will need to train the model in the same jurisdiction too, to avoid any kind of intervention into the training process. Ideally that would be a shelter company that "sells" you "training services" in that jurisdiction.
I don't know a lot of judges, but shenanigans like that (maybe the second or third time) are a great way to get a summary judgement, and find your way to county for a few days to think about the intent of a law.
Big publishers are more than happy to settle with AI companies - they just their slice of the pie after all. But who is going to protect, say, your Hacker News comments? Are you going to sue the AI company? Is YC going to sue? Are you going to sue YC for not banning crawlers in their robots.txt?
Why would I even bother doing that? I just write comments, it's not some ineffable wisdom. I'm writing them in public. I don't expect to profit from them somehow.
Heck, collecting whatever cents I might be owed for being a drop in the ocean is a losing move in my country.
My guess is the big players hope is to steal an enough content and then build a self training LLM based off synthetic content (rehashed original works) before the theft part matters. Not sure how close they are to achieving but this seems to be a common SV gamble.
Steal or do something shady, raise enough money / power so by the time your noticed, you have the money to win in the courts.
You already see the propaganda about “none of this should
matter because the cure for cancer is on the way courtesy of AI.” I mean maybe it it but it smells fishy to me.
> An LLM can be trained to find relevant knowledge online.
Why do you think chatGPT lost its Web Search plugin lately? Copyright lawsuits. You can't even use copyrighted content in the prompt because it will make the model makers liable.
But this doesn’t make sense - how is using ChatGPT to find information different from using a search engine? Especially if ChatGPT clearly lists its sources?
Given the already huge cost of training, and the evident lack of concern the LLM folks seem to have for copyright, why wouldn't the AI groups purchase subs to scrape the paywalled content?
The would possibly need to apply some effort to appear human, but that should only throttle the rate, not stop their scraping all together.
It's more difficult to scrape pay walled content no?
Clearly, places like Reddit have wised up to this and are making API usages non-free for example, so while it's not impossible, you can see the limitations being put into place already. Twitter is another one.
It seems like all this data is now considered gold and people lock up gold?
Books3 is the easy/fast/cheap method but if the quality of the model really brings in some sort of revenue is there anything stopping a company from buying/checking out of the library all these books, scanning and ORC'ing them and adding to the model the hard way?
I lean to the side of AI progress over copyright, though I fear AI becoming stupid when incentive for original works maybe goes down, so I think society needs to figure out some social contract at least, like subsidies to keep new content flowing.
For instance blogs will stop posting if chatbots in search grab all the answers directly from their sites bypassing all ad revenue.
I don't know why you would have that concern. LOTS of people write because they just want to share. Now broaden it out to include speech. LOTS of people talk - it's what we do.
Progress in AI will happen because humans like to express themselves. The challenge isn't copyright. It's figuring out how to capture the vast content that just isn't getting captured. Also, this is really only an issue for "new intelligence" - if you really think there is such a thing. Personally, I do not. I think like 99% of all human intelligence is in the out of copyright corpus.
> LOTS of people write because they just want to share.
The blogging scene from the early millennium is now a shadow of its former self, and one of the most often stated reasons for abandoning blogging is “my site just wasn’t getting many views any more”. In a world where AI generated content abounds, there will be even fewer eyeballs on whatever one shares and therefore less feeling of reward for sharing. Moreover, the people still blogging are often loading their content with referral links, because in an economy full of glamorous influencers, even ordinary people are tempted to seek some financial reward for sharing beyond the mere pleasure of it. Less eyeballs due to AI competition means fewer people clicking those referral links.
Yah, blogs are already a rounding error in the corpus, and that has nothing to do with llms. Those of us who are still blogging are already doing it in spite of ~waves hand broadly at the world~.
I'm not truly sure that llms mean less eyeballs, though. They produce mediocre content in an arena where high quality content matters. There's already a massive pile of crap on the internet; it's already all about surfacing the relevant and the interesting bits.
> LOTS of people write because they just want to share.
You just gave up the "for a living" group, who arguably produce overall better content (of course there are exceptions), and focused on hobbyists. I'd call that a self-defeat.
But not all and not most produce intellectual content for a living. And those who do you seem to be ok with ditching because if I read you correctly it's a small loss.
That's not what I was trying to convey, so I'll try again.
Humans generate a massive amount of natural language, and 99% of it never gets someplace where GPT/LLM training can consume it. If we can capture just a couple percent of that, then there will be no need for GPT/LLM to make use of content from those who don't want their writing to be consumed.
From a preservation angle, how big is Books3? Is it easy enough for mortals to mirror for that unlikely future where it might be possible to self-train reasonably good models from scratch if provisioned with data?
books3.tar.gz itself is ~37gb compressed. Often really the entire "The Pile" dataset (composed of both the mostly compressed archives, along with a ~450gb compressed jsonl compilation of the data) is being discussed. That's around 825gb.
Not speaking for ML, but I love jsonl as a distribution format. Lots of storage overhead vs a more appropriate bulk container, but it makes it trivial to stream, sample, concatenate, etc. Anything else (eg csv or parquet) is going to require better tooling and/or just a little shell magic to handle headers.
39,516,981,435 books3.tar.gz -- 36.5% compression ratio
108,371,325,720 tar -- uncompressed
Recompressed with 7-zip and xz:
25,221,357,605 b3.7z # with flags: -m0=ppmd (23.3%)
27,077,329,052 books3.tar.xz # with flags: -e9 (25.0%)
To see what other slower compressors could do, I checked results from a random 1,000 books. MCM would achieve about half the original tar.gz file size, but it's very slow.
Interesting. When it's copyrighted works in a digital form (plain text, ePub, whatever) it's a legal issue.
ROT13 the text is the data still a copyright violation? I imagine so since it's a trivial thing to restore it to a legally volatile form.
I understand that an unresolved issue is whether, once ingested into an LLM, the trained LLM is in violation of copyright. One wonders if human readers too are in violation of copyright for having been "trained" as well when they read a book.
Is a "brain transplant" from one LLM to another a thing? Perhaps just a trivial copy of the node weights or whatever they're called. That would would not let the target LLM off the hook with regard to copyright violation I expect.
But what if one LLM "taught" another. Maybe that is not a thing yet.
It would be interesting to see someone dig into the differences between the capture of language and culture, and control over how it gets fed back into cultural discourse, that we fear corporations achieving with commercialized LLMs, and the control pre-internet publishers had (and to a certain extent, still do), over what content, and in what format, language was disbursed and distributed. I am not at all certain that having your work used to train LLMs is a bigger threat to writers ownership of their work than publishing houses were and are.
I also suspect that if we had an effective mechanism to prevent use of copyrighted work in training, it would necessarily behoove an artist to opt their content out. Will you really want to be excluded what may well become the canonical mechanism for searching and generating language?
is there enough data on the web for a LLM to be agnostic about the source languages, Russian, Chinese, English, German, etc? Where training on Russian and Chinese and English and German etc sources would also incorporate enough information about translation that if the AI learned about some topic only through Chinese sources, it could still recognize/use/apply/express those ideas in English?
That’s a fascinating question. I’m not sure. It seems like if it learned about a topic in Chinese, it would be able to express it in English, but I haven’t seen this tested.
Large AI companies probably are not too worried about making deals with corporate content creators. Having access to content from a trigger-happy creator is only going to increase their advantage over competitors, after all. And if those creators were instead to try to introduce legislation, the AI companies would risk losing access to content from small creators without the means to sue too.
We seem to be moving into a world where corporate content cannot in any way be reused, remixed, or even archived. You cannot even own a copy - it is only accessible for a monthly fee and can disappear at any time. Meanwhile, anything created by independent creators is fair game to steal and rip off. Copyright was intended to promote and protect human creativity, but instead we got a rent-seeking mechanism used to stifle original creation.