Hacker News new | past | comments | ask | show | jobs | submit login
Training AI (lmnt.me)
54 points by zdw 5 months ago | hide | past | favorite | 57 comments



It's unfortunate, though completely predictable, how Generative AI has divided people so solidly into camps. On one side you have people who use the word "theft" as a kind of shibboleth. On the other, you have tech leaders saying things like, "maybe those creative roles shouldn't have existed in the first place".

As someone who is both a technologist but also has an art degree, I feel out of place in this conversation. Because, while I totally understand the economic anxiety and philosophical discomfort of this new technology – I also see it as a kind of Library of Alexandria. A synthesizer built on the collective knowledge output of all of humanity. It's incredible!

But, things didn't turn out so good for the first Library of Alexandria. Beware of mobs!


How can a company pretend to be a fake library, i.e use and charge for all those copyrighted works, while I can go to prison if I download a movie for free? Should I also pretend I’m my own library? That’s my issue with those companies.


Library of Alexandria never deprived any authors of work nor has it prevented anyone from earning a living off writing.

It would be incredible if it existed in an egalitarian, post-capitalistic climate rather than one where it deprives people of their livelihood.


It also burned down incidentally by Caesar burning a different part of the city and the fire spreading to the library.

Unsure how "beware of mobs!" fits in there.

It's as if the comment is intentionally written to be inflammatory. I suppose that is unfortunate.


We need an Atlas Shrugged but against AI.


I don't feel out of place.

I feel frustration and in some situations, contempt: with both sides, with the situation, with the collective decisions, and lack thereof that got us here, with the hoi polloi getting distracted,yet again, by their corporate masters (1), by a system eternally being ransacked by executives; executives so inept and profoundly sociopathic that they aren't even aware there's a quiet part not to say out loud.

(The contempt isn't evenly distributed, fwiw)

(1) https://x.com/doctorow/status/1804240404986184072?t=2z7-WEwB...


Yeah. Not just the sociopaths (which I think are a small minority), but I think the societal awareness and conventions window has shifted profoundly in tech, and we're very echo-chamber.

Tech industry today is pretty much 1980s greed-is-good Wall Street screwing over everyone they can, and the main difference is that today we don't realize we're behaving like the stereotypical coked-up bad guys.

We do tend to have some socially progressive ideas among tech workers, which is great. But not great is if we anchor our self-perceptions to that do-gooder side, and then give ourselves a blank check for society-destroying greed and obliviousness in other regards. (Can we please respect people's pronouns and also not screw over their economics, privacy, thinking, government, and general wellbeing?)


> we don't realize we're behaving like the stereotypical coked-up bad guys.

I recently came to realization that the 80s were able to be fueled by cocaine because we lacked the tools to identify a coked out maniac. People must have just seen these people as "engaged, enthusiastic, and passionate".

Now, we have the tools to see the current snake oil salesmen (this is what? our nth cycle? VR, GigEconomy, Blockchain, AR, ...) but people are still defending them. It's difficult to watch again and again.


> Publishing text or images on the web does not make it fair game to train AI on. The “public” in “public web” means free to access; it does not mean it's free to use.

US Copyright law (which assigns that very copyright) disagrees with this. "Fair use" means you can use it, in certain transformative ways, not merely "access" it.


The European Union's Artificial Intelligence Act allows copyright holders to reserve the right to "opt out", which US law still does not.

I think it is still very unclear though whether this could have any effect, or if is going to be a dud.

One weakness is that it does not specify how. There are a number of opting-out protocols out there, and with variety it follows of course that while some site might use some protocol/s, a robot might support another. For example I could find that Cara.app uses DeviantArt's "noai" tag only in its HTML headers, but when I read through the source of Spawning.ai's "datadiligence" library [1], I found that it checks for that tag only in HTTP headers. So even in those cases when the company scraping the web for content has intent to comply with EU law, there is still the risk that they could be violating it.

[1]: https://github.com/Spawning-Inc/datadiligence/


Read up some on that law because precedent is rather down on mass corporate use.


I guess it comes down with whether you think training an AI is like storing your content in a database to be retrieved later, or like giving it to a person to learn from (and quote, get inspired from, copy from memory, etc).

I'm in the latter camp, so I don't see why anyone would be OK with their content being seen by humans but not used to train an AI.


Why we would ever equate humanity's creative response to art and a machine's rote ingestion of data into mechanical output?

Morally, legally, socially, these are entirely different.


Currently ingestion has been proven to be legal. https://www.theverge.com/2024/2/13/24072131/sarah-silverman-...


Never said it should be illegal, it should be treated differently in a legal framework.


It was simply a footnote.


Because there's money to be made by granting AI an equivalent legal status to human beings (which negates many copyright issues) and by normalizing the output of AI as being equivalent to that of human beings.

None of this is about the science. All of it is about the money.


If a human copies or traces over another artist's work and then presents it as their own, that's widely considered immoral and in some cases it's illegal.

The difference between taking inspiration from other work and copying that work relies on the actions and intent, but in the case of AI both of those are obfuscated and we only see the end result.

I think this makes it very difficult to gauge whether particular AI work is problematic.


You cannot copy from memory. I have a semiphotographic memory of nearly all of Star Wars (all three episodes). Do you think that I can reproduce this and distributed it to others? Legally? Morally?


Yes you can, both legally and morally, and yet nobody tries to ban you from watching Star Wars.


How do you square that with Bright Tunes v. Harrisongs [1], which found liability for a tune that the composer unintentionally (by most accounts, and in the opinion of the court) reproduced from memory?

[1] https://www.americanbar.org/groups/intellectual_property_law...


If I recreate from memory Star Wars using Blender and then distribute it to others do you think this is Legal / Moral?


That isn’t what is happening with AI training though… If you apply that to AI, that is more like training using only Dan Brown novels and then asking the AI to recreate The Da Vinci Code.

How much of the work you use in your transformations is a major factor for derivative works. So you probably could make a single Star Wars scene from memory using Blender. But not the whole movie.


I mean, we don't have to guess – this is what current copyright law protects against.

The question is whether it should be illegal even if you never profit from it, or somehow dilute the IP owner's brand.


>I mean, we don't have to guess – this is what current copyright law protects against.

I dont think this is clear to others given the parent comment answer Yes instead of No.


Well, totally fair because I also did not read very carefully. You said 'distribute' – and in that one word is where lawyers and judges earn their pay :)


I believe AI uses publicly accessible content similarly to how humans do: we discuss what we read, use it, and create new content from it. This process is legal and acceptable. The fact that an artificial entity is doing it doesn't make it wrong. Moreover, I think intellectual property shouldn't be owned. In today's world, intellectual works should automatically benefit society.


That’s fine, but then we need to stop privatizing the AI. If a coalition of companies (and maybe government?) were to train open model weights, and run those models on equal-access hardware (i.e. all residents of the country get N compute credits per year) then we’d be much closer to a system where “intellectual works should automatically benefit society” — however, that is not what we are seeing today.


Why? I don’t need to equally and freely share the expertise I develop by consuming publicly available information. In fact, I personally profit from it. Should I compensate every YouTube creator, author, journalist, for the money I’ve made in my career that their publicly available work contributed to in terms of my learning/education?


> AI uses publicly accessible content similarly to how humans do

It absolutely does not. Bunch of companies with pretty low ethical standards are scraping all they can find. People who wrote all this content never consented to this, and many of them explicitly opted-out.


Since this is news.yc, I get to be a little pedantic to begin with. In this case all the little mistakes start to add up quickly.

- 'We own what we make the moment we make it.' ... you hold a copyright.

- 'Publishing text or images on the web does not make it fair game to train AI on'. ... postulate without further grounding elsewhere in the text.

- ' The “public” in “public web” means free to access; it does not mean it's free to use.' . Actually it is, barring restrictions on copying and trademarks etc as per (implementations of) the Berne convention.

- 'Someone who publishes my work as their own (theft)'. No, this is a) a copyright violation or b) plagiarism.

- 'or republishes my work (like quoting or linking back)' This depends, but linking is not (re)publication of your work. Quoting counts as fair use copying.

- 'doesn’t have the right to make the choice for me to let my content be used for training AI'. A link is not your content, you have no rights to it period. If someone makes a fair use quote of your content, then they may do so under fair use as a quote. I don't really see "quoting" as reposting, since that implies a substantial amount of copying. If it is not substantial, an AI would be permitted to use the quote in a similar manner. However, since an AI training system does not typically retain a copy of your quote, copyright probably doesn't apply at all (exceptions excepted).

- "Whether reposting my content elsewhere is in good faith or not". Previous examples were not of reposting (probably) but if they count as reposting, then you have standing to sue the person doing so either way. AI training is a red herring here.

- "To add insult to injury, that person may not have the knowledge—or even the power—to do so if they’re posting content they don’t own on a site they also don’t own, like social media.". Let's make a difference between ownership and holding copyrights. If you post things, you must hold the copyright, or have permission from the copyright holder. If you do not hold the copyright holder, posting content to a site you do not own is not permitted by law, and very likely a violation of the TOS of the site where you posted.

- "I can play whac-a-mole with those bots on servers I control—which I don’t like doing, for the record—but I have none of that control anywhere else." ... sure, but you seem to be pulling together disparate concepts and several misunderstandings of law here to begin with. Your conclusion(s) do not follow from the premises, given real world laws, treaties, and customs.


“Someone who publishes my work […] or republishes my work […] doesn’t have the right to make the choice for me to let my content be used for training AI.”

Fair enough, but what does that have to do with the default value of robots.txt? Opt-in or opt-out, they’re still making the choice for you.


"Owning ideas" is a relatively new concept -- intellectual property is a few centuries old.

"Ownership" is completely opposite to a natural state of things, and can only be enforced with force. Mix that abstract concept with the abstract of information for extra complexity. Add to it that Information can't be "owned" in the same sense that physical property can, and the fact that (if not copyrighed), it's a non-excludable, non-rivalrous good (it can benefit everyone, equally, at the same time).

On the other hand, it's nice to incentivize innovation, competition and creativity, or at least, not hinder it.

It would be very interesting to go to 2124 and check if these concepts are still alive, or if somehow we have evolved our social norms to be closer to the natural world. Once you convey information to any entity outside of your brain/body, it naturally belongs to the world.


And copyright definitely isn’t ownership. You don’t own the work. You have a limited exclusive right to how the work is used, distributed, or monetized.

A right is not ownership.


Why is training humans different from training AI, then? If we make a few edits to the text:

> "Someone who publishes my work as their own (theft) or republishes my work (like quoting or linking back) doesn’t have the right to make the choice for me to let my content be used for educating a student. This is where I struggle the most with the “opt-out” style of student education on the web."

Imagine a student who gets into a top-ranked research university, studies lots of material from proprietary textbooks, takes in hours of instruction from professors who have strict IP contracts with their university, and then graduates - can they then go produce for-profit content built on that knowledge base without paying a fraction of their earnings to the university that trained them?


John Gruber's article, which this one is a response to: https://daringfireball.net/2024/06/training_large_language_m...


> The whole point of the public web is that it’s there to learn from — even if the learner isn’t human.

What a joke.

Right, right, in 1989 Tim Berners-Lee and team all said to eachother, "We build this thing to learn from, even if the learner isn't human."

And lo, the world wide web was born.


Even as an Apple fan myself, he's tough to stomach.


Has there been any progress made on this at all since ChatGPT? I remember when the Zoom story broke on their TOS, and that story got covered in morning news in USA.

But Zoom was about using videos for AI training, everything else is about text. And it looks like nobody really cares about text, except for maybe NY Times which is suing OpenAI.

What is the deal with this? It pisses me off that a few major companies are about to gobble up the entire internet and capitalise on it solely for themselves. Not going to give back anything to anyone, they even want you to pay a monthly fee just to get access to your own data.


>and capitalise on it solely for themselves.

It benefits millions of people who use these models every day, even the most powerful ones like GPT-4o and Claude 3.5 Sonnet are available for free to the public (with some rate limit).


I don’t know if “nobody cares about text” so much as they’re waiting for NYTimes to set precedent that they can use to crush OpenAI. OpenAI is also making content deals with a bunch of publishers to allow them to use their work for training, so maybe people don’t care if they’re still paid?


I hope the irony of "the generation of piracy becomes the knights of copyright" in the wake of AI training isn't lost on everyone else.

Not saying I didn't pirate stuff or wouldn't do it today, or calling out this particular author. Just recognizing how people by and large are simply self-interested while wearing a cloak of morality.


Is it the same generation? I grew up in the Napster, MIT license, software-patents-are-bad, "information just wants to be free, man" days, and have more or less the same views today, particularly when it comes to AI model training and such.

But time moves on, the internet skews younger, and Gen Z is all the rage now, and starting to set talking points. I think when all your music has always been streamed for free or like $8/mo, playing by the rules feels different.

But I am curious how many former "intellectual property isn't property" folks there are that are now on the side of "scraping web content to train a model is immoral".


I bet the venn diagram for those two overlap a lot. They are both black-and-white perspectives on very complex nuanced topics.


This is what I've been knocking around inside my head. I'm part of the limewire and BitTorrent generation. I remember everyone in highschool and college pirating everything they could, from music to textbooks and professional software.

I personally believe large corporations wont be able to control the productive AI market without regulatory capture. Creating models is just too easy once you have the data and the compute. I see all these pushes for regulation as red herrings that will likely result in cementing the current leaders in place rather than protecting the people that created the IP that went into their products.


> I see all these pushes for regulation as red herrings that will likely result in cementing the current leaders in place.

The unfortunate truth is that the large corporations were going to do this with or without the legislation - that's a separate largely independently issue.

At the core, scale and control over data ultimately make or break these technologies, and large players (hiring the few talents capable of properly utilizing these technologies) are simply better equipped to leverage the markets.

The difference being, intent and public commons. At least with the legislation, public commons do not become fair game for corporate plundering, just as the public cannot (legally) plunder the "corporate" data.

Without the legislation, the corporate simply own and control all your data, without much explicit work on your part both technically and legally to defend your rights.

> I'm part of the limewire and BitTorrent generation

Myself as well - however this is where intent comes into play. Bringing data into the "public" commons as quickly and often as possible is most useful for the public, not corporations. Obviously, some of those public are in the employ of the corporate, however generally the intent of said generation was to "pirate, then buy". All studies on the matter showed this was true - free advertising and increased sales (for large scale ventures at least). Smaller vendors had trouble recouping costs, it is true, however attempts were made to build in compensation mechanisms for these cases as well.

In both cases, what is common is the intent to most broadly benefit the public, either through legislation (which is intended to serve the public interests) or through broadening the commons (which is intended to serve the public interests). In both cases, the technology employed suffered(suffers) gaping harmful holes that may not be possible to address except by some form of legislation/recompense.

The fact that corporations can and will exploit legal frameworks should not however, be a reason to think of all such efforts as "red herrings".


I really appreciate this well thought out reply. It's helped frame a few of my thoughts better. Thank you.


> The “public” in “public web” means free to access; it does not mean it's free to use.

I strongly disagree with this. The public web is as much "free to publish" as "free to use." If you don't want people to have unrestricted access to it, either put it behind a paywall, throttle access server side, or just don't publish at all.

The whole copyright system is already a step too far in the modern age when the marginal cost of personal copying is zero, I refuse to further give ground on freedom of association, which is what the internet is at its core. If someone republishes your copyright content, you can use the existing legal mechanisms to get it taken down and recover damages, but if you don't want people to use the content, don't publish it openly on the internet.


… and this is how you lose an AI war to an enemy who has been ignoring copyright laws for decades. Unless it’s clear that “more data” does NOT produce faster progress it’s just safer to anllow indiscriminate scraping.


I don't see how opt-in solves the problem of people who content they don't own. They could opt-in your content just as they can currently post it on their website.


At least, with opt-in, you would be able to prove that you were allowed to train your model on the content. With opt-out, that is impossible: You can not prove that something does not exist.


I wonder, what happens to content posted on say Instagram/Reddit/Forums etc.

They usually have a clause around getting a license to all the content posted on their site and being able to sub-license/modify it/create derivative work.

Doesn't that basically mean once content is posted there. They can do what ever they want with it? What happens in those cases when the creator has given permission(although unknowingly perhaps)?


Reddit has already made licensing deals about user content with Google: https://www.reuters.com/technology/reddit-ai-content-licensi...


It's been almost three years since the release of GPT-3, and there still isn't any regulation or standardization for opting out/in of AI training.

Can we expect any changes in this area at all? Can anyone point me to some resources?


Regulation: The recent European Union Artificial Intelligence Act [1] defines in article 105 together with EU directive 2019/790 article 4 [2] that copyright holders have the right to opt out. The US has no such regulation.

Standardisation: Not really. There exist a bunch of novel protocols, but apart from robots.txt (which can only name robots you know), there isn't any established standard. The Open Future think-tank has posted a couple overviews of these [3] [4].

[1] https://www.europarl.europa.eu/doceo/document/TA-9-2024-0138...

[2] https://eur-lex.europa.eu/eli/dir/2019/790/oj

[3] https://openfuture.eu/publication/defining-best-practices-fo...

[4] https://openfuture.pubpub.org/pub/considerations-for-impleme...


These training dataset arguments, that act like “fair use” isn’t a real thing that’s written into copyright law, are getting old and intellectually disingenuous.

This statement is false in so many circumstances: ‘The “public” in “public web” means free to access; it does not mean it's free to use.’ What is access without use even?

Because there are plenty of ways I can use that public content without the copyright holder having any say. It could be satire, quoting sizable sections of text, making a parody, teaching, news, etc.

Maybe you don’t think AI training should count as fair use, but there are countless ways already to create transformative/derivative works of from others copyrighted works without their consent.

This isn’t anywhere close to the black and white problem so many make it out to be.


Self important drivel




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: