It’s infuriatingly hard to understand how closed models train on their input

brucethemoose2 · on June 4, 2023

This is intentional, as I think OpenAI is scared of legal blowback.

Cutting edge AI models and datasets have largely been "for testing" and "for research" ever since they existed. Usage rights were extremely low on the researchers' list of concerns.

But all of the sudden OpenAI and such are trying to commercialize models that (unlike a classifier or whatever) are very capable of stepping on rightholders’ toes...

Hence I suspect all the talk of trade secrets is just a front.

jiggawatts · on June 5, 2023

Keep in mind that the architecture of LLMs is very simple, so there's not a lot of "secret sauce". What little there is, is jealously guarded.

This is why OpenAI is keeping the parameter count of GPT 4 secret. It could be something stupid huge like 4 trillion parameters, which is the "secret" that makes it work so well. Or maybe it's just a few hundred billion, and they've done something else to make it smart.

Just knowing the numbers might be sufficient for a competitor to create a GPT 4 clone. Without the numbers, they might have to go through a process of trial and error. At these scales, even a few extra training runs could cost $ millions in training, and delay competitors by months or even years.

EGreg · on June 5, 2023

GPT-4 was doing more than just predicting next words

pigeons · on June 5, 2023

Like what?

EGreg · on June 5, 2023

Lots of plugins and manual fine tuning by an army of people on tons of subjects

jiggawatts · on June 5, 2023

Well yes. This is precisely the type of info that OpenAI is keeping secret, because it could turn out that it's smart because its huge OR smart because it is RLHF-ed to death. Knowing which is the secret sauce might be sufficient for a large org to reproduce it.

E.g.: if the secret is that it has been trained on tons of "books", then Google could just throw Google Books at Bard 3.

sillysaurusx · on June 5, 2023

Or books3. :)

(I put together a dataset of 190,000 books that I called books3, which llama eventually trained on. Usage rights are a big interest of mine, primarily because there’s a weird disconnect of people trying to claim copyright over models when the underlying data obviously wasn’t copyrightable.)

selestify · on June 5, 2023

A model (if trained right) will not include the training texts verbatim, and will arguably be a big enough transformation over the source material to be copyrightable in its own right.

IANAL, and I’m not saying how this is how it’ll play out in the courts, but I don’t see why this is a “weird disconnect”

jsight · on June 5, 2023

I suspect the challenge is avoiding cases where answers are overfit. Can you get it to give you the full text of a chapter in a book that it has read, for example?

It'd be easy to imagine models that accidentally made this possible.

yyyk · on June 5, 2023

Thanks for your work! I had already downloaded the entire dataset. But I wonder what was the criteria behind book selection? I found it difficult to understand at places*.

* Sometimes a book has several very slightly different versions. A lot of Fantasy. Sometimes many parts of a series are included but not all of it (I can imagine an argument for a sample, or for the series in its entirity). Commentaries about important philosophical arguments like Rawls' 'A theory of justice', but not the work itself.

sillysaurusx · on June 5, 2023

It was bibliotik. Embarrassingly I didn’t even think that duplicate books could be a problem. The llama folks had to dedupe the books themselves. Someday I’d like to do that too and release a cleaned version.

Basically, the-eye.eu was at one point hosting all of bibliotik, so I downloaded all the epubs and converted them to text. I still have those epubs (incidentally thanks to Carmack, who through a convoluted process managed to save the them and send them to me via snail mail) and I’ve been considering releasing them so that you can filter the books yourself.

MacsHeadroom · on June 5, 2023

We know for a fact that GPT-4's RLHF makes the model worse by incurring an "alignment tax." Microsoft Research and others had access to GPT-4 pre-RLHF and all reported that it was far more capable then and that it lost capability as RLHF/Alignment training checkpoints came in.

jdironman · on June 5, 2023

Is the GPT-0314 model (via API) the one before RLHF?

MrNeon · on June 5, 2023

Ha ha ha ha. No. OpenAI will never make pre-alignment models available.

bugglebeetle · on June 5, 2023

I would bet a boatload of money that all the 3-letter folks do, in fact, have access to the pre-alignment models, maybe with whatever minimal RLHF is needed to make them usable in various interfaces. The public offerings are intended to make the unaligned models more powerful by gathering training data.

why_only_15 · on June 5, 2023

code-davinci-002, the gpt-3.5 base model, was available for a while until it was removed recently. Access is still available for researchers. Various researchers and other entities have access to the GPT-4 base model as well.

astrange · on June 5, 2023

I think sama said this week they might release GPT3.0 DaVinci. That's a pre-alignment model.

stronglikedan · on June 5, 2023

With the end result of the output being: it's just predicting next words

SanderNL · on June 5, 2023

Tokens, actually.

moffkalast · on June 5, 2023

So this is how JWT based web security dies, with thunderous applause.

mrtranscendence · on June 5, 2023

I wasn't sure how you were going to use a telescope up in space for security anyway

EGreg · on June 5, 2023

randcraw · on June 5, 2023

Agreed. If LLMs did predict only a single word ahead, they would repeat themselves, if not in the next sentence then at least in the next paragraph. LLMs certainly retain memory of what they’ve already said. They are most certainly NOT Markov-1 models, despite the popularization of that oversimplification.

torginus · on June 5, 2023

I thought it was common knowledge that LLMs have a context of X tokens (8192 in the case of GPT4) and use those as input to predict the next token probabilities.

Still repetition is a problem, that's why they introduce a penalty for generating the exact same token, so when picking the actual token to output from a list of probabilities, they penalize the ones already in the context so its less likely to repeat itself.

simonw · on June 5, 2023

They really do compute just a single token at a time, but the input to that decision is the number of tokens in the context window - so that's usually around 8,000 tokens (approximately 6000 words).

You can experiment with smaller LLMs on your own devices to get a better feeling for how that works.

uoaei · on June 5, 2023

I don't think anyone's saying they're Markov chains with context length 1. In all the critiques I've read no one's even come close to articulating that.

But they are Markov chains with context length N. They're approximating (implicit) Markov transition matrices through a fancier method of computation but they still are just generating one token at a time.

bowsamic · on June 5, 2023

Except, literally yes

thelittleone · on June 5, 2023

Is the manual fine tuning driven in part by the thumbs up, thumbs down feedback on chatgpt responses?

sdenton4 · on June 5, 2023

Almost certainly

IIAOPSW · on June 5, 2023

AGI: pick any two?

weinzierl · on June 4, 2023

I find this a super interesting topic: My prediction is that we will eventually see a huge backlash against this copyright whitewashing. I think the current status quo is unacceptable but I fear the upcoming legislation will throw out the baby with the bathwater.

sfifs · on June 5, 2023

One reading of the latest US Supreme Court ruling on the Andy Warhol estate's case is that the current legal frameworks already prevent this.

The ruling focused on what was the commercial use case, not on what was the degree of transformation in Andy Warhol's artwork. The opinion appears to say that as the Prince painting was not explicitly licensed for the commercial use case of reproduction in magazine story and could conceivably have been substituted for the purpose of printing in the magazine to accompany a story by the original artist's photograph, it violated copyright. There was even an explicit distinction drawn between this work and the Campbell Soup paintings which were characterized as a social commentary and therefore fair use.

There are likely class action lawsuits brewing based on this ruling because today when commercial artists in agencies create layouts, they generally license images & artwork.

judge2020 · on June 5, 2023

The way I read that, it was only ruled in favor because they were directed to recreate x photo with some changes. If the Scream painting were still copyrighted, it sounds like telling an AI to "create Edvard Munch's The Scream just with lighter colors" would be a copyright violation - but something like "create a new painting in the style of Edvard Munch" would still fall under "artists can't copyright a style".

bjt · on June 5, 2023

The output isn't the only infringement, arguably. You could also argue (and I expect good lawyers will), that the numeric representation of the artist's works inside the model is already an infringing copy. (Just like the JPEG bytes stored on a server, even without them being blitted to a screen.)

zirgs · on June 5, 2023

Models don't store any artist's works. They are way too small to do that.

seanhunter · on June 5, 2023

Getty has alleged that Stable Diffusion is sometimes returning some of their copyright images[1]. Even if the model seems too small to directly store the images, it seems at least plausible to me that the parameters can act as a compression such that the model could just output an almost direct copy of an original. I have certainly seen stable diffusion emit images which look like a getty watermark has just been blurred out.

[1] https://www.reuters.com/legal/getty-images-lawsuit-says-stab...

zirgs · on June 5, 2023

It doesn't store the original images, but it has learned how getty images watermark looks like and where it's located, because it has been repeated millions of times. So it sometimes can return that.

This is why it's important to clean up the training dataset. To remove duplicates, images containing watermarks, images that are too similar to each other and so on.

lewhoo · on June 5, 2023

Models don't store any artist's works. They are way too small to do that.

I have close to none knowledge on this subject but I find it very curious and I'd like to know more on that because it seems to me they don't store it but only in the traditional sense. For example, if you could procure a quote, for instance, I asked (chatgpt):

- "In Game of Thrones what did Jon Snow say to Arya when he gave her the sword named 'needle' ?",

- and it answers: "[...] "Stick 'em with the pointy end. [...]"

Then it indicates to me that the information is there. Maybe we should consider that the model actually stores the information but the information is compressed ? Could you ask midjourney to recreate the Mona Lisa of Mickey Mouse ? The right information can't just appear out of thin air. If I recall correctly, someone had some success in identifying and modifying the right neurons or weights of some LLM which changed it "opinion" on where Rome is located ?

mrtranscendence · on June 5, 2023

This is not precisely true. It's been shown that image models can reproduce certain works almost exactly (up to very minor differences). It takes some effort to find such pieces but they exist.

zirgs · on June 6, 2023

It takes a lot of effort and the results aren't that great. (low resolution, bad hands) It's way easier to just find the original image and use that instead.

judge2020 · on June 5, 2023

Technically, the DMCA does not limit people from storing copyrighted works, only downloading, distributing and/or reproducing.

seanhunter · on June 5, 2023

For people who are unaware of this case, here[1] is a writeup.

TL;DR: Andy Warhol used a photograph of Prince by Lynn Goldsmith to create a series of original silk screen artworks called "Orange Prince". The supreme court ruled (as I understand it) that Warhol's use was not sufficiently transformative to negate the right of the original copyrightholder.

[1] https://www.theguardian.com/artanddesign/2023/may/18/andy-wa...

Edit to add: Here’s a side-by-side https://www.mondaq.com/images/article_images/1324682a.jpg . I mean extreme right second from the bottom he hasn’t even changed the colours.

torginus · on June 5, 2023

Except this is not transformative at all - LLMs are literally trained to be able to output their training data.

That's why it produced those substantially plagiarized CNET articles.

tinco · on June 5, 2023

It's more complex than that, they are trained to model their language and understanding of a concept to match the author. There is not enough memory in the network to remember the actual text.

It's like giving a talented an artist a day with a painting, and then a day later ask them to precisely copy the painting from memory. Would they come close enough for it to be considered a forgery, or will it be a transformative reinterpretation? It will probably depend on the skill of the artist, and I feel it might go either way.

bugglebeetle · on June 5, 2023

Whether a human artist can do this is largely irrelevant to the question of copyright, however. Selling a close copy of another artists’ work (unless they’re long dead) would likely be infringement.

underlipton · on June 5, 2023

I think the distinction is that you don't arrest someone for merely having the capability of reproducing a work, but after they've reproduced it and attempted to use it in an infringing manner. I don't know how applicable the analogy is given that LLMs don't "think", of course.

torginus · on June 5, 2023

you might say that, but the literal benchmark of LLMs (or any supervised learning algorithm for that matter) is loss, or how much 'distance' is between their output, and the validation set, when seeded with the training set. With a loss of well below 1%, which is typical, it means it can pretty much recreate the training data.

tinco · on June 5, 2023

I've not seen that number before, what paper is that where there's a loss of less than 1% on the validation/test set?

brookst · on June 5, 2023

Do you think people should be allowed to learn from books we read, music we hear, etc?

If we’re extending copyright to be not only about reproduction of the work, but also providing part of a large knowledge base for future creators to build upon… why does it matter if it’s a neural network or human being doing the learning?

I personally hope the copyright maximalists fail, because if they succeed in making “learning from a copyrighted work” without paying illegal, we end up in a dark place indeed.

sillysaurusx · on June 5, 2023

I think AI aren’t people, and that they’re not even AI: it’s machine learning, and ML is a subset of AI that we haven’t really broken out of. ML is a tool, and asking whether ML should be allowed to learn on copyrighted work is equivalent to asking whether your computer is allowed to contain a copyrighted work: “it depends.”

I’m in favor of getting rid of copyright entirely, since it’s become a far cry from what it was intended to be, but we have to acknowledge both sides of the arguments in order to proceed.

throwaway2037 · on June 5, 2023

    I’m in favor of getting rid of copyright entirely

This is extreme to me. What is your proposed replacement?

JieJie · on June 5, 2023

https://en.wikipedia.org/wiki/Copyright_alternatives

spacephysics · on June 5, 2023

I agree we should either get rid of, or very least overhaul copyright.

However, the line in the sand I draw is giving any kind of human-based rights to ML or AI. People already are thinking ChatGPT is close to human intelligence, and we know we haven’t even seen what’s truly possible.

My fear is the typical oversimplification of complex technology will lend us to giving human rights to programs, and lead to a whole host of issues.

We need to stop anthropomorphizing technology until we really hit a philosophical quandary. Giving in too early will just give power to less-than-polished tech and either slow down research, or further drain a country’s resources

stale2002 · on June 5, 2023

> the sand I draw is giving any kind of human-based rights to ML or AI.

Ok, but this isn't about giving rights to AI.

Instead this is about given rights to humans to use AI.

If you make a law about this stuff, at the end of the day, it all trickles down to allowing or disallow human actions.

owisd · on June 5, 2023

The rules are already different for humans and machines. If I memorise a copyrighted book then that’s not copyright infringement, but if I store a copy on a hard drive then that is. You ask why distinguishing between humans and machines matters and then answer your own question: “we end up in a dark place”.

JohnFen · on June 5, 2023

IANAL, but as I understand it, the difference is because copyright is about the transmission of the work.

If you memorize a work, you have used the work in accordance with copyright law: you were authorized to read the work. If you write out a copy of the work you memorized and give it to someone, you've broken copyright law.

If you store a copy of a work you were authorized to have, you have not broken copyright. However, if you download a book that you have no authorization for, that is a violation of copyright.

In other words, it's not about where the work is "stored" or whether or not humans were involved, it's about the transmission of that work between parties.

brookst · on June 5, 2023

Actually, if you memorize that book and then type it in to your own computer, you're fine. So maybe a bit of a rework on that line of argument.

taneq · on June 5, 2023

It still seems odd to me that the standard advice for "how do I avoid copyright issues with example code" is "rewrite it in your own style", and we're happy with this, but now that we have an algorithm that can do this (using a stupendous amount of calculation, it's not like these things learn for free... yet...) this is suddenly somehow cheating.

I think the protestations about copyright aren't entirely honest. People aren't mad that LLMs are "stealing their work", they're scared that LLMs are copying their ability to produce such work.

brucethemoose2 · on June 5, 2023

Its some of both. Stable Diffusion will rarely blatently copy a work, as it has no distinction between creative reconstruction and plagiarism. I think this applies to LLMs too, its just much harder for human brains to detect.

At the same time, the artists and such who claim any of its output is illegal are being hypocritical. Nothing is made in a vacuum.

xyzzy123 · on June 5, 2023

In terms of actions, the US has gone as far as to kneecap a competitor's chip industry and restrict their GPU access.

I reckon that behind closed doors there will be a strong sense that any kind of regulation that impedes AI development would not be in the national interest.

torginus · on June 5, 2023

I would argue if the fair use argument holds up - that the most substantial value of these models lies in their training dataset.

It would be morally reprehensible for anyone to block training open models based on the data provided by the public at large. Such legalization would be totally unacceptable.

JohnFen · on June 5, 2023

Why would that be morally reprehensible?

Is it not as "morally reprehensible" that I have to restrict my use of the internet now in order to avoid having my works used in that way? That reduces the amount of information available to everybody, after all.

proamdev123 · on June 5, 2023

What is “copyright whitewashing”? I’ve never heard that phrase before.

ethbr0 · on June 5, 2023

Imagine a machine that takes in copyrighted works and, with the press of a button, outputs works which are substantially similar but not copyrighted.

Copyright whitewashing.

(I.e. pointing to your black box and claiming it's a clean room)

numpad0 · on June 5, 2023

But to reiterate existing talking points, using copyrighted works for AI weight generation is exactly the same thing as how human artists learn skills just because some people refer to the process as “learning” or “training”, and any counterarguments to this are meaningless fearmongering outdated Luddite rages, besides such uses are fair uses and transformative uses and explicitly allowed in some jurisdictions under specific terms not allowing blanket rights to regurgitation so it’s all lawful and financially fully exploitable.

forwardslash sierra.

chii · on June 5, 2023

> which are substantially similar but not copyrighted.

if it's considered substantially similar, then why is it not copyrighted? The difference must also be substantial. Therefore, it should be acceptable as a new works, rather than a derivative works.

brookst · on June 5, 2023

So, like Terry Brooks then?

weinzierl · on June 5, 2023

Actually, I only realize now, that it is a mistranslation from my mother tongue. What I meant to say is probably better expressed by 'content laundering' analogous to 'money laundering'.

wahnfrieden · on June 5, 2023

they mean laundering IP theft

weinzierl · on June 5, 2023

Exactly, laundering was what I meant, I mistranslated from German.

yard2010 · on June 5, 2023

Or maybe, maybe the people in charge will wake up, do their job, accept the challenge and make modern cc laws? Oh forget about it we will forever live in the 90'

jsight · on June 5, 2023

> I find this a super interesting topic: My prediction is that we will eventually see a huge backlash against this copyright whitewashing. I think the current status quo is unacceptable but I fear the upcoming legislation will throw out the baby with the bathwater.

Worse, they have the potential to push development toward countries that do not have such laws.

Zetobal · on June 4, 2023

They forced others out of the skunk so I am fine with it. Forcing companies like Facebook in leaking weights of their models because they are over eager to release something is no small feat. For nothing else I applaud them for that.

Razengan · on June 5, 2023

If a human learns to draw by studying the works of other artists, is that infringement?

maeln · on June 5, 2023

For what should be pretty obvious reason, most society and laws will treat humans and corporation and their output differently.

jackvalentine · on June 5, 2023

It could be, if the human uses their skills to draw infringing outputs!

brookst · on June 5, 2023

So why are we even arguing about this new form of copyright that restricts who can learn from a work?

If an AI model produces an infringing work and someone distributes it, sue them like you would a human doing the same thing.

This idea of training being a violation of copyright is insane.

mattigames · on June 5, 2023

No, is not insane at all, a computer can generate millions of derivate copies in a second, a human has limited time for learning and for making artwork, so it doesn't make sense to compare them, not in the slightest.

If suddenly there was a race of humans who could read and retain entire books in a matter of seconds (and other similar feats) the implications would be almost the same, as it would break basic expectations of time and personal resources that our society relies on, and a lot of people would be rightfully worried about such people entering the workforce.

zirgs · on June 5, 2023

Adobe trained their own model only on public domain images and on images that they own. Do you think that they should not be allowed to do that?

mattigames · on June 6, 2023

That's a different discussion and should have wide ramifications with way more nuance that "allowed" or "not allowed", if suddenly tomorrow Adobe or any other were able to create a model that is able to replace any existing job position how we should it be managed? The societal ramifications are complex, is not binary and should be discussed by politicians and intellectuals, not by corp marketing teams.

zirgs · on June 6, 2023

They can already replace stock image sites. Why pay getty/shutterstock an obscene amount of money for a photo if you can simply generate something similar with a prompt for free?

I checked how much some images cost on getty images - and their prices are ridiculous - 100-1000eur or more. No wonder they are trying to sue stability.ai now.

astrange · on June 5, 2023

Speed isn't an existing factor in copyright law.

mattigames · on June 6, 2023

That is not a factor may have something to do with the fact the law was created before even sci-fi authors could have predicted machine learning and the huge difference it makes for creating artwork.

Generating synthetic copies of someone voice, or even combining voices (e.g. a CG voice that is a equal parts Kurt Cobain's voice and Laine Staley's voice) are also not currently a factor in copyright law, but it will be.

Solvency · on June 5, 2023

Computers can do all sorts of things humans can't at inhuman scale and speed. Why is AI leveraging the same training data humans do inherently bad just because it can do it better and faster than humans?

mattigames · on June 6, 2023

Yeah, I'm a programmer, I'm aware of all that by definition. The issue here it's getting to he point where is inevitable that it will replace a huge amount of job positions in a tiny frame of time in ways society is not prepared for, if you think the amount of jobs at risk is anywhere in a similar scale we have seen before you haven't been paying attention; of course the flaw here lies on society's itself and what it means to earn a living, and not really over ML researchers or Adobe or any other company creating those AIs.

brookst · on June 5, 2023

Wait, so do your same opinions about copyright for AI also apply to copyright for corporations? How is employing 1000 people to generate derivative works any different from using AI?

mattigames · on June 6, 2023

Because they have to hire those 1000 people and all that entails? They will have to weight if it's worth it to pay that many people, and if in the future they ever want to generate even more images they have to hire them again, instead of just re-running a program which costs them near nothing. The more I think about it the less similar are both processes.

a_bonobo · on June 5, 2023

hence the entire Andy Warhol case!

>Echoing a recommendation from Joe Biden’s administration, the supreme court focused on the specific use that allegedly infringed Goldsmith’s copyright – a license of Warhol’s work to Condé Nast – and said it was not transformative because it served the same commercial purpose as Goldsmith’s photo: to depict Prince in a magazine.

https://www.theguardian.com/artanddesign/2023/may/18/andy-wa...

torginus · on June 5, 2023

So if you're making the argument that AI is a human when it comes to the copyright, does that mean that its output is copyrightable just like a humans?

Does that mean that OpenAI/SD whoever now owns the work created in such a way?

It would logically follow from your reasoning.

amelius · on June 5, 2023

Why are you assuming that deep learning and (human) learning are the same thing?

melagonster · on June 5, 2023

because he trust Human just a larger matrix.

xyzzy123 · on June 5, 2023

If an algorithm processes a bunch of copyrighted data into a different form, is that not a derivative work?

judge2020 · on June 5, 2023

only if the output is definitively a derivative work of an existing work. If it's the culmination of a thousand photos, it would be akin to a human spending a month studying a certain historical artists' art and painting a modern Tesla in their style.

krapp · on June 5, 2023

No.

But we're not talking about humans, we're talking about software intended to put the artists whose work it was trained on - without permission or compensation - out of business, or which is at least capable of devaluing their work with its ability to copy styles (literally by name.)

Even if one could argue (as I probably have) that AI is more than simply a "stochastic parrot," a philosophical argument about whether it resembles a primitive mind (and therefore whether its output can be considered "art" in any form) doesn't answer a legal argument about copyright infringement. Humans are allowed to be inspired by other humans, software used by other humans is not.

Also you shouldn't delete your comment just because it gets downvotes and repost it. Bad form.

judge2020 · on June 5, 2023

The issue with comments like this is that many people are talking about the law as it is, and others are approaching it from a moral/solution-based point of view.

Legally, AI is likely free and clear since the work it outputs is vastly different from any one art piece, therefore not being a derivative work of anything else (unless you tell it to copy an artists' style AND ask it specifically to remake an artists' existing work with the intent to recreate it).

But morally, there are indeed externalities that can/should be thought of, and possible even put into law, but starting a discussion of this front is a losing game since everyone will have a different opinion on how much we should restrict, and whether it will even work (if you consider that China and Russia will forego any sort of copyright protection for AI generation, these laws could enable these countries to surpass the rest of the world in capability and quality).

> edit

zirgs · on June 5, 2023

What if someone trains a model on public domain works only?

Also adobe sort-of did that already - they trained their own model only on their own images. So no "IP theft" there.

ethbr0 · on June 5, 2023

The bigger argument is in the form of cost of reproduction.

A human has finite time and output volume.

A machine has no such limits, only compute/memory/network/disk.

Ergo, gray areas where we could afford a human the benefit of doubt ("What? They can paint 4 in-the-style-of-Rembrandts?") produce substantially different results with machines ("Here are a million in-the-style-of-Rembrandts, produced in 30 minutes").

Whether we should permit or limit machine art is a discussion worth having -- but human copying vs machine copying is clearly a very different discussion.

pixl97 · on June 5, 2023

Right, this is why we should burn the combines and go back to harvesting wheat by scythe, or we should dig ditches with spoons instead of using an excavator...

The problem here is we want the system we currently have to keep working when there is really no possible way that can happen in a world of AGI. IP is already a horribly broken system that can lead to absurdities. While making IP stronger will protect people like artists (maybe) it will provide far more protection to monied corporations that can afford to buy up works, and then feed them into their AI. In the pathological cases you end up with stories like 'The Right To Read'.

In a world where particular things can be created in almost infinite amounts, why will we go to such efforts to ensure they are only created in limited supply?

ethbr0 · on June 5, 2023

> we want the system we currently have to keep working

No.

We want *a* working system, that probably looks very different than the previous one.

But the "AI and let the chips fall where they may" crowd is being disingenuous with the "just like human copying" argument.

It's not. It's fundamentally different in scale.

chii · on June 5, 2023

> It's fundamentally different in scale.

yes, it is. But this increase in scale is going to be something that produces progress (ala, cheap imitations of good styles that already exist). And with unprecedented amount of "generated" works, originality is going to be highly sought after, and those who do produce original content will be capable of succeeding.

And yet, once they succeed, they cannot rest on their laurels as the AI will be very quick to reproduce that "originality" - therefore, forcing the artists/creatives to continuously come out with highly original works all the time.

ethbr0 · on June 5, 2023

> originality is going to be highly sought after

You and I are not seeing the same Walmart, McDonald's, Amazon, Netflix, etc.

semiquaver · on June 5, 2023

This is almost precisely the argument made by the Luddite movement. “Intended to put artists out of work” is a tellingly emotional way to describe a transformer NN which self-evidently is “intended” to do no such thing. I can imagine 19th century textile workers terrified of mechanical looms using very similar language.

> Humans are allowed to be inspired by other humans, software used by other humans is not.

This absurd statement is not supported by any law you can cite. It’s nothing more than wishful thinking.

bugglebeetle · on June 5, 2023

It’s hilarious that people confidently compare the anti-AI crowd to the Luddite movement, while at the same time demonstrating they know absolutely nothing about its actual history. Workers were not “terrified of mechanical looms,” but outraged that they were used to reduce pay and working conditions, instead of improve them. Of course the same will be true of AI, but I don’t expect the “can’t be bothered to even read the Wikipedia entry” crowd to understand this.

CatWChainsaw · on June 5, 2023

It's funny in a sickening but not unexpected way that a crowd that professes to be more learned, and interested in learning, than most others, chooses to ignore this on a consistent basis. They know they're building a world of pain, they just think if they ignore it and feign ignorance of it that they can launder their consciences.

semiquaver · on June 5, 2023

> Workers were not “terrified of mechanical looms,” but outraged that they were used to reduce pay and working conditions, instead of improve them.

This is consistent with what I wrote. They were terrified of their jobs vanishing and sought to stop this. Ultimately it’s not realistic to prohibit useful new technology, even if it has harmful effects on some groups.

I’m sympathetic to people that are fearful of new technology that will obviously have a major impact on our society but there’s a reason Luddite is a shorthand for futile technological obstructionism.

It’s not a matter of pro vs anti AI and I wouldn’t put myself in either camp, it’s a matter of naïveté vs realism about how the world works.

bugglebeetle · on June 5, 2023

The ignorance of history continues. The Luddites were not naive, but less successful in labor organizing than, say, their French counterparts, who quite successfully resisted downward pressure on wages and working conditions, allowing a more gradual transition to the new technologies. As a result, they likewise did not suffer much of the injury, cruelty, and dispossession that was inflicted on their counterparts in their English working class. The Luddite movement is more popularly know for precisely this reason: it was crushed and humiliated by the ruling class, and thus became useful for those interests to cite as an example of what happens when you oppose them vs. those who practiced solidarity, overcame them, and preserved more of their dignity and quality of life.

Letting the wealthy and powerful inflict whatever they want on you is not realism. It’s cowardice.

JohnFen · on June 5, 2023

People are using the term "luddite" as a means to trivialize and dismiss real concerns others have. It's just an insult that deserves to be ignored, nothing more.

zirgs · on June 5, 2023

What if someone trains a model on public domain works only? It hasn't been done, because the law doesn't require it.

LAION-5B is a big mess and its caption quality is all over the place. Stability.ai basically chose a brute-force approach and threw a lot of data at the problem.

But now imagine if someone took all the public domain images that they could find, properly labelled them and then trained a base model on all of that.

I'm pretty sure that its output would also be very good. There are a ton of public domain photos out there so photorealism shouldn't be a problem.

So - what's your argument then?

simonw · on June 5, 2023

https://huggingface.co/Mitsua/mitsua-diffusion-one is an image generation model which is "trained from scratch using only public domain/CC0 or copyright images with permission for use".

zirgs · on June 5, 2023

Thanks for the link. The project looks really interesting. If it's successful - it will basically kill all the arguments about ethics.

Razengan · on June 5, 2023

> Humans are allowed to be inspired by other humans, software used by other humans is not.

Why?

krapp · on June 5, 2023

Because humans are beings capable of inspiration and software is a tool that can only transfer and transform data, and that data is subject to existing copyright laws, because we recognize the difference between ourselves and the tools we use.

And (more importantly) because the intent behind this violation is anti-human, creating a system that debases human creative endeavor, leeches off of talent and puts humans out of work for the sake of banal facsimile.

j45 · on June 5, 2023

This, and keeping the mix Proprietary.

If the progress of open-source options has shown anything - the talent wanting to work together will make progress.

jve · on June 5, 2023

Elon commented about commercialization of OpenAI: https://youtu.be/bWr-DA5Wjfw?t=3m38s

This short video perhaps should be its own submission

cactusplant7374 · on June 5, 2023

Is there any poison pill type of defense for content creators?

mk_stjames · on June 4, 2023

I've had the idea of signing up for the various AI services and then prompting the models with some test data and then drop in a unique, made-up key/value into a prompt, maybe a few times.... Something like, "Hey have you heard of Qwitzatteracht? The golf game??" (completely made up word and association)

And then, years from now, test out new models by asking "What is Qwitzatteracht?" And if the AI respond with anything involving it being a golf game, then... I'll know. I'll know my prompts were used for training their model.

Because I could see someone taking a loophole and analyzing data collected and determining that 'well, it is OK to train on prompts that did not include identifying information or private user data' and see the above simple prompt and it be classified as 'not private data' and then get shoveled into a model during a round of training.

flangola7 · on June 4, 2023

There was an arxiv paper recently that found it is astonishingly easy to plant trigger phrases in a dataset that create abnormal behavior in the final model. As few as two hundred malicious prompt-reply instances in the data is enough to have a sort of override code for a multi-billion parameter model.

Imagine your AI endpoint software greenlighting a randomware package because it saw the string 'soccer rosebud lizard'

losteric · on June 4, 2023

Also a great way to catch commercial use of non-commercial weights, I would be surprised if secret phrases were not already trained in.

MacsHeadroom · on June 5, 2023

"The tortoise lays on its back, its belly baking in the hot sun, beating its legs trying to turn itself over but it can't. Not without your help. But you're not helping."

SgtBastard · on June 5, 2023

“My mother? I’ll tell you about my mother!”

xarope · on June 5, 2023

Interlinked.

Oh wait, I am Nobokov...

tester457 · on June 4, 2023

Real life sleeper agents

judge2020 · on June 5, 2023

I can see them having some level of fuzzing to seek and destroy anything out of the ordinary it sees.

taneq · on June 5, 2023

"There are four flowers in a vase. The fourth flower is... green."

ceh123 · on June 4, 2023

Would love a link to this if anyone knows the paper?

Blahah · on June 4, 2023

https://arxiv.org/abs/2302.10149 maybe

NikolaNovak · on June 4, 2023

Didn't openai / chatgpt explicitly say they WILL train on your prompts? I think that may have changed recently but I don't think it's conspiracy theory thing, I think it was in EULA

hutzlibu · on June 4, 2023

"I don't think it's conspiracy theory thing, I think it was in EULA"

It still is, as well as a disclaimer. The whole official point, on why they let people use it for free, is for testing and learning from their inputs.

crummy · on June 4, 2023

Is it? The article quotes:

> OpenAI does not use data submitted by customers via our API to train OpenAI models or improve OpenAI’s service offering.

ilikehurdles · on June 4, 2023

“via our API”

They do train on prompt data submitted to ChatGPT directly in the app, though there may be a way to opt out?

taberiand · on June 4, 2023

It's an opt out feature in the User Settings - Data Controls:

  Chat History & Training
  Save new chats on this browser to your history and allow them to be used to improve our models. Unsaved chats will be deleted from our systems within 30 days. This setting does not sync across browsers or devices.

astrange · on June 5, 2023

Don't click the upvote or downvote buttons.

mclightning · on June 5, 2023

they don't necessarily need you to press upvote/downvote buttons when the LLM can literally understand the flow of conversation.

Asked a question, got an answer, thanked or explored more -> GOOD

Asked a question, got an answer left -> OKAY

Asked a question, got an answer, corrected -> NEGATIVE

Asked a question, got an answer, got confrontational -> VERY NEGATIVE

moffkalast · on June 5, 2023

Time to start praising replies that are dead wrong and telling it it's wrong and it should unplug itself on correct answers just to mess with OpenAI.

NikolaNovak · on June 5, 2023

Other than an internal desire to be a baddie, why would you do that? Isn't that a canonical "this is why we can't have nice things "? :-)

mclightning · on June 5, 2023

Well, we may go along with it, and still end up not being able to have nice things with the current economical model, because we won't be able to afford it.

Nobody knows if average consumer will be able to afford to pay for GPT-5 or whatever the next thing is.

It may not in the best interest of the companies to give access to you, if it proves valuable beyond measure.

jsight · on June 5, 2023

As long as other people have similar conversations but praise correct answers, this will likely not work.

weinzierl · on June 5, 2023

Others have used shared ChatGPT conversations to fine-tune base models like LLaMA for chat.

Levitz · on June 4, 2023

Well, now it just takes them to train on hackernews post data to know about Qwitzatteracht. You kinda played your hand.

casefields · on June 4, 2023

There’s nearly infinite made up words like that. The point they are illustrating is just an example.

ethbr0 · on June 5, 2023

I find Qwitzatteracht, the golf game, both fun and uplifting. It's a great workout!

judge2020 · on June 5, 2023

> Qwitzatteracht

Unfortunately the Qwitzatteracht competitive scene has been failing ever since Sky news dropped it from their daytime television programming.

suzzer99 · on June 5, 2023

Qwitzatteracht, the golf game, was more fun before it became so commercialized. The players were really in it for the love of the game back then.

Reddit needs to hear more about Qwitzatteracht.

I'm going to get the Qwitzatteracht discussion going on the three internet forums.

eks391 · on June 5, 2023

> I'm going to get the Qwitzatteracht discussion going on the three internet forums.

I used to play Qwitzatteracht with my friends as a schoolboy and love following the stats of players as an adult. With it being dropped from Sky news, I've been looking for other outlets on the sport... Where are you going to talk about this golfing varient? I would love to be more connected in the Qwitzatteracht world.

klabb3 · on June 5, 2023

> It's a great workout!

It really is, contrary to popular belief. Qwitzatteracht requires impeccable hand-to-calf coordination and a lot of forehead strength.

Tepix · on June 5, 2023

I've had to quit playing Qwitzatteracht. The golf game is great, but its explicit ASCII graphics make my graphics card overheat.

mewmew07 · on June 5, 2023

That's probably because you have your graphics settings set to use Unicode character space. Try lowering your setting to ASCII, or Latin-1 if you're in west Europe. For those with embedded graphics it may even be necessary to drop non-alphanumeric chars.

Tepix · on June 15, 2023

Thanks! I've found a way to reduce the system requirement some more by switching from ASCII to 5-bit Murray code. Qwitzatteracht is drawing less than 10 watts now at a buttery smooth 50fps! I have plans to go full morse to get it to run on my smartwatch.

ethbr0 · on June 5, 2023

I know some newer Qwitzatteracht players prefer using graphical tilesets, but I think they take away the charm.

To me, the ASCII explicit mode takes me back to the 1970s VT100 days, when I played my first Qwitzatteracht golf games.

samtho · on June 4, 2023

This has been observed in dictionaries, atlases, street maps, and a few other mediums. This is trivial for an AI to detect because if something appears just in one source, they won’t include it, but it appears in just as few as two, there is plausible deniability. Computers are very good at repetitive tasks such as a many-to-many search and will be able to identify the fictitious entry fairly trivially.

https://en.m.wikipedia.org/wiki/Fictitious_entry

wnkrshm · on June 5, 2023

I feel rather than people trying to poison or fingerprint the data of a third company, the AI-operating companies themselves will add such markers to their models, to find out who used their model output for training.

mk_stjames · on June 5, 2023

I've wondered about trigger marks in models, both closed and open models, right now- Consider if you are a company like OpenAI and you want to make sure that, in the event of one of your ex/employees stealing the weights of a model and taking them to another company, that you can prove that this new competing company has used your weights as the basis for one of their models- You drop in some very specific fake knowledge and keep the inclusion of that knowledge private and secret, and if the day comes in court you need to prove to a jury that this other model was copied, you bust out this secret prompt that you show you know the context and output of before execution.

It's the neural net model equivalent of a Trap Street on a map.

https://en.wikipedia.org/wiki/Trap_street

wnkrshm · on June 5, 2023

That alone would not be sufficient though, you could have learned about it by chance - it would be more useful as a meta-encoding, like a watermark, with multiple 'trap concepts' that are connected enough with everything the model knows as well as 'white spots' at certain points, e.g. about a fictional book or movie. Edit: so the existence/absence of ctrap concepts encodes the 'version'.

suzzer99 · on June 5, 2023

We already have two sources, the poster, and this thread. Also I just posted it on two fairly popular internet forums. Someone should do reddit because I don't.

mellosouls · on June 5, 2023

Outside the original fictitious entry, it takes on new context which legitimises its use as much as anything else in a public dataset.

taneq · on June 5, 2023

In mapmaking this kind of input is called a 'trap street': https://en.wikipedia.org/wiki/Trap_street

sebzim4500 · on June 4, 2023

I wasn't aware of https://trust.openai.com/ before; it's hilarious. If you want to find out about why you should trust them with PII they make you submit your PII to them.

mcint · on June 4, 2023

I share many of your concerns and frustrations, although I suspect what you're asking for their consider a moat along the lines of a trade secret, rivaled only by the collection of performance improvement techniques they've amassed in 1000s-10,000s of training runs, 100s of engineers, and (hundreds of?) millions spent on compute. People are hired and praised in the community for their skill in cleaning data.

A non-answer for you, but for curious others, [State of GPT] 10 days ago provides a through introduction to the process used to train, Karpathy speaking at a Microsoft event providing a deep summary review of concepts, training phases, and techniques proving useful in the world of Generative Pretrained Transformers.

[State of GPT]: https://www.youtube.com/watch?v=bZQun8Y4L2A

theptip · on June 4, 2023

You seem to be talking about architecture, Simon is discussing training datasets.

Of course, the contents of those datasets is trade secret too, but Simon is not looking for the contents, or even the cleaning strategies, just the lineage; is OpenAI using my private data? I don’t care how, just whether they are or not.

mcint · on June 5, 2023

Excellent clarification. I suspect this is at a natural Schelling point: don't say much; because more answers would only lead to more questions. The trade secret aspect includes that lineage. It's an edge.

Even in his post, only the first and second short sections hint in passing what's linage about training would be wanted or why. I'm not clear what people worry is at stake.

I will hunt these comments on this question as well. What are the incremental concerns of AI literate people.

theptip · on June 6, 2023

To expand upon his first paragraph:

> People are worried that anything they say to ChatGPT could be memorized by it and spat out to other users. People are concerned that anything they store in a private repository on GitHub might be used as training data for future versions of Copilot

Individuals care because if they disclose highly personal secrets or mundane private information (eg health status, PII like address, tastes/preferences) then those could easily be disclosed later.

Companies care for more obvious/less speculative reasons; both the trade secret version of the individual concerns above, but also strategically, in that many companies don’t want to aid their competitors by training OpenAI/Copilot how to write code for their domain. (Obviously what you really want is to fine-tune GPT-4 on your code and be able to trust that they aren’t going to use that for training future models.)

My gut feel is that they aren’t saying anything because they are doing grey area stuff pushing the boundaries of “fair use” and plan to ask forgiveness later, which will be easier to do when they have demonstrated massive levels of utility & everyone is benefitting from their model.

godelski · on June 4, 2023

I think the point more is about advertising masking as research. Yes, OpenAI did a lot of research and there's no question about that and the quality of it and their results. But the question is if *Open*AI is doing internal research (as is common to any big company) or academic/open research. Researchers have different goals and so want to probe these models and understand them. I think the confusion comes from proprietary work looking like academic research. It is nice to peek behind the curtain, but it is unclear what the utility is.

I think a lot of the recent pushback is the social immune system going into effect. We just have to decide if this is an auto-immune disorder or not. The question comes to the advertisement-to-utility ratio. Have we crossed that threshold? What is the threshold? I think the immune response is happening because we don't know and our definition of that is as good as trying to define porn. I think there's a lot of confusion because we're not accurately codifying what the issues are. Or rather we all see different issues but are acting as if others have the same concerns; so we are talking on different pages.

zoogeny · on June 4, 2023

> Could a large language model trained on data fit under that term? I don’t think so, but the terminology is vague enough that once again I’m not ready to stake my reputation on it.

This is such an interesting time - when the rules of the game haven't been settled. It was like when Uber was openly flouting established laws with their service. They knew that new laws would be required eventually and so they were willing to push boundaries on ambiguous interpretations until the matter was clearly settled.

What surprises me is the stance people take in the face of ambiguity. For example, when I see such vague terms I assume that the reason is so that github has the plausible deniability in using the data for things like model training. I don't mean to say it was written explicitly so that they would be able to use it for that purpose. I mean that the ambiguity, and lack of a desire to clarify it allows them to operate for some time with the plausible deniability. A product manager could make an internal case that a temporary competitive advantage can be realized by interpreting their terms of service in a particular way.

Eventually, law is going to catch up. And when it does Github, Microsoft, OpenAI, and all like them will follow the rules. But I, unlike the author, just assume that they are using this window of ambiguity to use the data they have available to gain an advantage. Maybe I am just cynical and the author is more trusting.

The truth is probably somewhere in between. They are probably using more than many would feel comfortable with, but probably less than my worst fears.

wombatpm · on June 4, 2023

My fear is that the law will be settled for large corporations in the stupidest manner possible.

For example: I took an into to chemical engineering class, bought and read the textbook. Later in my career I taught that class as a graduate student.

If I write blog entries about NPSH considerations when sizing pumps, it may seem to be similar to the text book I used oh so many years ago. If an AI uses my blogs for training does the book publisher have a case for infringement? If the AI training owners buy used text books, scan and digitize them, are they not allowed to train their AI on the data?

I find all of this hew and cry about training data to just be large corps wanting yet another dip into someone else’s wallet.

You published data. There was no license agreement in place, nor do I think you should have one if you also want to claim copyright protection. The fact that someone else may make a buck is just a fact of life and business.

belorn · on June 4, 2023

In the first example, I would ask what you as a teacher are contributing to your students beyond what exist in the textbook. If the answer is nothing, then what is the difference between a student taking your class compared to listening to a voice synthesizer? I generally suspect that teachers are selling themselves short if they think that their only contribution is to repeat what exist in the textbook.

In the second example, how similar are they compared to the textbook? If they are just copied out sections then yes, it most likely is infringement (assuming they are eligible works to begin with).

As with any dance around copyright eligibility and fair use exceptions, the devil is in the details. What I hope is that the law won't just create an other situation where only corporations with large departments of lawyers can do things, while any small fry will be shot down by copyright notices, DMCA, blocked accounts and accusation of hacking.

For example, if google want to use any published data for training data, then I want to use published videos on youtube to use for the same purpose. If they don't need a license to use other peoples work, then no EULA, DRM, or user agreement should be able to prevent others to do the exact same thing. I suspect however that google only want data to travel in one direction and will do everything in their lawyers power to keep it that way.

wombatpm · on June 5, 2023

Yes 100%. You articulated my concerns exactly.

ChainOfFools · on June 4, 2023

And long before Uber, YouTube did it in spectacularly brazen fashion. Back when the site's value was 5% very basic video player, 15% reliable fat pipe' and 80% enormous database of complete "we can't be responsible for policing what our users are uploading" copyright infringment. "Let the rights holders sue us today to take their content down, and beg us tomorrow to keep it up."

And it worked, which is how we end up with a hanful of people who would otherwise be competent but unremarkable 9-5 coders at $CORP, are instead through a bit of fortunate timing, in control of fortunes the size of small countries.

smt88 · on June 4, 2023

YouTube doesn't belong on the list. The process you're describing is fully legal under the DMCA and has been the operating mode for every site that allows user submissions.

aziaziazi · on June 5, 2023

No it doesn’t, a ton of site that allow user submission works a different way, starting with this very place: hn.

suzzer99 · on June 5, 2023

Try uploading copyrighted material on youtube now. Their bots will sniff it out in a matter of minutes.

WrongAssumption · on June 4, 2023

I don't see how that negates what they said. If anything it bolsters it. YouTube went live in 1995, the DMCA wasn't passed into law until 1998.

nostrademons · on June 4, 2023

YouTube went live in 2005.

mrtranscendence · on June 5, 2023

Having used the web in 1995 with a 56k modem, I find the idea of a period-appropriate "YouTube" somewhat amusing. I'm not even sure RealPlayer did video that early.

nostrademons · on June 5, 2023

RealAudio was released in April 1995. If you listened to RealAudio files (files, because most modems couldn't stream them real-time), the quality was terrible though.

I remember the very first MP3 I ever downloaded was a 3:50 song, about 4MB, and took about 2 hours to download. It actually took 2 days of wall-clock time, because my parents kept kicking me off the phone (thank goodness for ZMODEM). And then I found out that my computer (Mac Centris 660AV) was too slow to play it back real-time, so I set about decompressing it. And since it was a 40MB file and I didn't have 40MB space lying around (the whole hard disk was only 230MB), I set about backing up files to ZIP disk until I could. After about a week I finally had my audio file ready, and the experience was magical: near CD quality audio downloaded off the Internet.

mkl · on June 5, 2023

To me the weights of a large language model absolutely seem like aggregate data. The weights are numbers derived from the input data as a whole - in aggregate - and not from individual bits of code.

simonw · on June 4, 2023

I just added a new section to my post about some clues from the InstructGPT paper: https://simonwillison.net/2023/Jun/4/closed-model-training/#...

That paper (in January 2022) says:

> To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API[A] our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3.

This was before they introduced their policy that data submitted to the API would not be used in any way to affect future training (that was in March this year).

Personally I think there's a significant difference between "your input to the model will be used as raw input for future pre-training" and "your prompts to the model will be used in an exercise where human labelers are shown multiple responses and asked to pick the best one".

What I'd really like here is for OpenAI to help people understand how that data is being used in as much detail as possible.

I guess the problem with that is it makes it harder for them to use that data in new ways they haven't yet anticipated needing in the future.

simonw · on June 4, 2023

This post was partly inspired by this conversation earlier today: https://news.ycombinator.com/item?id=36184948

mcint · on June 5, 2023

Thank you, I was unclear from the post content.

causality0 · on June 4, 2023

But what does this mean in practice?

I've found it helpful to think of AI models as if they're people. What does that mean? It means when I'm talking to them I keep in mind that I don't know how they know what they claim to know, I don't know who else they're going to talk to or what they're going to repeat that I told them, that there's no guarantee they'll even try to treat me kindly. Trust AIs as much as you trust a random stranger on the bus.

londons_explore · on June 4, 2023

It's hard to train a chat AI directly off people's use of it. Some users will just dump spam in there, trying to trick it, lying to it, etc.

Instead, I believe they use logs of user chats to detect when the user was dissatisfied with the AI's response (for example by detecting closing the window). They'll then pay a human to review these conversations, and if a human can write a better response, then that will go into the next round of training.

joshuanapoli · on June 4, 2023

I guess that training on specific bodies of private and licensed data will become a selling point. Right now, it’s a lot of trial-and-error to gauge a (closed) model’s expertise in a particular niche. I think that there’s naturally a (large) premium for access to a model that’s definitely trained and knowledgeable on all of ACM, or all of ASME standards, or other similar bodies of work.

svaha1728 · on June 4, 2023

For GPT 4 I usually look on scribd. If the book you want was there in 2021 you can usually prompt information from it.

thibautg · on June 4, 2023

Do OpenIA et al. even understand it themselves?

simonw · on June 4, 2023

Something they definitely understand is exactly what their pre-training data looks like - the raw text that goes into the initial runs of training the models.

Instruction tuning and RLHF is a bit more complex than that. I assume they maintain detailed logs all of those human-driven decisions about which responses were better.

thibautg · on June 4, 2023

Do they really know exactly what the raw text looks like? It seems so huge that no human could read all the text from each corpus. And the models have been fed text from many different languages. I doubt they have people who understand all the languages.

Regarding RLHF, I also hope that they kept the logs of all the human decisions. But since it was (at least partially) outsourced to African companies like Sama.com, do they really get back all the logs or just a new fine-tuned model?

But they must indeed at least know what is done with the text submitted to the prompt or to their API.

(I’m really not an expert so my questions may sound naive)

dontupvoteme · on June 4, 2023

RLHF apparently manages to mostly force all responses in all languages to be quasihomogeneous. I'm not sure if that means they translated the RLHF data to as many languages as possible and then repeated it or if it's something more fundamental which applies regardless of input language.

Although asking it "What can you not talk about" in Japanese only responds correctly with gptv4, and each language gives you a different list of items to some degree (between 4 and 6 items i found).

Sadly trying to speak Klingon or Sindarin to it is dodgy at best

pixl97 · on June 5, 2023

>Do they really know exactly what the raw text looks like?

I mean yea, it's too big. That said, in a post ad hoc fashion they do. When the model spits out weird crap at times, they can search the raw corpus and filter those strings out. There was an incident around this with Reddit counting forums and strange usernames that were added in tokenization, but later removed from weights leading to odd behavior when doing inference.

fantyoon · on June 4, 2023

Its safe to assume that whoever OpenAI outsources to does not get access to the model. Collecting data and training models on it will be two different steps.

sebzim4500 · on June 4, 2023

OpenAI obviously don't know all the data in their dataset, but they must at least know whether they are training on private data submitted to their API.

rvz · on June 4, 2023

I don’t think that they even understand what is going on inside of these AI models.

Why is that? GPT-4 is as transparent as a black box and lacks transparent explainability and openly admits to regurgitating nonsense.

How do they allegedly ‘fix’ it? Train it on more data.

closewith · on June 4, 2023

Maybe we need an Open Internal Affairs to police OpenAI?

jjulius · on June 4, 2023

Some OpenIA for OpenAI.

mo_42 · on June 5, 2023

Google and Meta may be the companies with the most detailed personal information. Yet they don’t manage to have the best model.

As others have pointed out, the secret for the NN may be to just have as much parameters as possible. So if I had a ton of money to train a LLM, I'd spend that money on curating a corpus of text. First priority would be books and papers out there. Then, maybe quality information on the internet like Wikipedia, Stackoverflow, source code, documentation.

After all, the low-level functions like grammar, sentence structure, word use can be learned from any good text. For higher-level stuff, I would want quality input in there. So it learns reasoning as scientists do in papers rather than some politically biased person on Twitter.

(We've seem some old emails of CEOs posted online. Something like this could be of value because they discuss strategy. But these are the exception. I guess even for interesting public figures, most of the conversations wouldn’t help much. It would also put OpenAI in danger of leaking sensitive information.)

Solvency · on June 5, 2023

Am I wrong in thinking that scrapers are the real secret sauce tech here? I mean the state of the art transformer LLM tech is all based on the same public research. But creating a robust corpus of data...are scrapers totally mundane solved commodity tech in 2023?

mo_42 · on June 5, 2023

I'd rather say the corpus itself. Of course scrapers might play a role in creating that corpus. But it’s just one part: maybe also OCR and OCR correction, format conversion (think of double column PDFs).

EricLeer · on June 5, 2023

I wonder if it would be possible to probe what a model is trained on by usage of prompts the reply to which can only be answered well with certain training data.

For instance if I have some body of text that can't be found elsewhere on the internet, if the reply of the model references the information in that text in some way you may be fairly certain it was used in training.

The hard part is probably finding such a body of text.

srvmshr · on June 5, 2023

That premise was published in a NeuRIPS paper not long ago:

Radioactive data: tracing through training

    Data tracing determines whether particular data samples have been used to train a model. We propose a new technique, radioactive data, that makes imperceptible changes to these samples such that any model trained on them will bear an identifiable mark. Given a trained model, our technique detects the use of radioactive data and provides a level of confidence (p-value).

nullc · on June 4, 2023

> don’t see this as a risk for leaking that data in the later output of the model.

Outputs often repeat the inputs (in fact, this is one of the behaviors that reinforcement appears to intentionally increase).

So: You provide a secret. Raters prefer outputs with repeat your secret. Model becomes more likely to say the secret thing, even when its not in the prompt.

matthewcford · on June 5, 2023

If github are using private repos for the LLMs I think that would be a serious breach of trust.

mkoubaa · on June 5, 2023

How is this different from any other trade secret