Hacker News new | past | comments | ask | show | jobs | submit login

ChatGPT doesn't have a concept of sources. It has weights that together define a function that allow it to guess the most likely next word from the context. As a neat side effect of this contextual next-word guessing, it often can share accurate information.

If ChatGPT were to be required to share its sources, they would need a completely different approach. I'm not commenting on whether or not that would be a bad thing, but it would render the current iteration completely useless. You can't strap a source-crediting mechanism on top of a transformers-based model after the fact.




> You can't strap a source-crediting mechanism on top of a transformers-based model after the fact.

I've read that ChatGPT is not connected to the net, but if it was: Couldn't you have it do a google search (or better yet corpus search) for the string it generated and then return the most significant matches (significance by string matching, not google rank)? It would be really crude, but wouldn't this just be a handful of lines of code that don't interfere with the "transformers-based model" code at all?


Why couldn't you, as a human do that to verify it?

The other day I had GPT write a rap battle between Burger King and Ronald McDonald. One of the stanzas came back:

    Burger King:
    Your burgers are plain, your buns a bore.
    Your clown's been around since '63,
    I'm sure my flame-grilled taste will leave you impressed
    My burgers are fresh, my fries are the best
It turns out that yes, Ronald McDonald was first introduced in 1963. https://en.wikipedia.org/wiki/File:McDonald%27s_commercial_(... (from https://en.wikipedia.org/wiki/Willard_Scott#Created_Ronald_M... )

So here's the challenge for you - who do you compensate for that line?

The complaint that people have isn't that GPT isn't citing its sources but rather that it isn't compensating the people who created the data that has that information.

... and now, if you're ever asked about historical clown trivia and pull out the "Ronald has been around since 1963", who should you give a royalty to? Me (for writing this), GPT (for making me aware of it), Wikipedia (for the source of my links in this post), the estate of Willard Scott for the Joy of Living (which Wikipedia cites), some random blog author that had some clown trivia on it that happened to have been part of the training set for GPT?


Because I want to credit not verify? Because I want to trace the flow of information?

It isn't just monetary compensation that's important here.

I come at this from the point of view of a scientist who is expected to reference ideas. Not necessarily back to their original source, but at least back to a source that can theoretically point back to another link in the chain.

Sure, I can manually search for a reference based on what ChatGPT gave me. Or someone could spend a few minutes adding a few lines of code to ChatGPT to save millions of people some minutes of time.

-----

What would be awesome is an LLM that you can feed data to, and it can then write a paper based solely on the data you feed it.


I've still got the question - who should I credit with the bit that Ronald has been around since 1963?

I had it write a poem the other day in the style of Roses are read about coffee and bacon.

   Roses are red
   Bacon is greasy
   My coffee is hot
   Together they please me
If this is something that someone considers to be a derivative work of other things... who do I credit?

    Identify a word that have different meanings to two different professions at the same time and the professions that use them.  Give the definition of the word for each profession. Write a joke using this word.
to which I got back:

    The word is "band." 

    Definition for a Musician: A group of musicians who play music together.
    Definition for an Astronomer: A dark region in the sky with less stars.

    Joke: What did the astronomer say when the musician asked him to join his band? "I'm sorry, I don't do solos in the dark!"
How do you credit that?

---

> What would be awesome is an LLM that you can feed data to, and it can then write a paper based solely on the data you feed it.

https://platform.openai.com/docs/guides/fine-tuning


> If this is something that someone considers to be a derivative work of other things... who do I credit?

Based on a quick search the best credits would be ChatGPT as the arranger, and "Roud Folk Song Index number 19798" as the inspiration.

> "Joke: What did the astronomer say when the musician asked him to join his band? "I'm sorry, I don't do solos in the dark!""

> "How do you credit that?"

That you credit to ChatGPT. It's not referencing facts or discoveries, so credit isn't as important as it is for articles. If you want to credit an inspiration then I'm sure there's an index of joke forms out there that has an appropriate number to cite.

I can't actually find a definition for band in astronomy that is "a dark region in the sky with less stars." So it seems to be a pretty poor joke.

> https://platform.openai.com/docs/guides/fine-tuning

This does it solely based on the data you feed into it? And by data I mean scientific data that you discovered, and want formatted into a particular research article style.

Edit to add: Possible sources for the line "together they please me":

1) https://www.google.com/books/edition/Poetical_Works_of_Louis...

2) https://www.google.com/books/edition/Florio_s_First_fruites/...


Why did you pick that index rather than some other source material? Roses are red dates back to 1784 (year not index number) as a nursery rhyme. Does it need to be credited or is it in the public consciousness to the point where one can create a poem based on it without knowing its original source?

    Write a haiku about bacon and coffee.  Identify the syllable count for each word and line used in the haiku.
    Example:
    Bacon (2) sizzles (2)
    Aroma (3) of (1) coffee (2) too (1)
    Mouthwatering (4) bliss (1)

    Smoky (2) bacon (2)
    Brewing (3) coffee (2) aroma (3)
    Makes (1) mornings (2) bright (2)
The second poem is from GPT. Do we need to credit the dictionary where it got the syllable count for each word? Or where it got that coffee (rather than bacon) is brewed? Or that bacon and coffee are things more often consumed in the morning?

    Identify four foods or beverages that are frequently consumed in the morning and how each is prepared for breakfast.

    1. Coffee: prepared by brewing hot water over ground coffee beans.
    2. Cereal: prepared by pouring cereal into a bowl and adding milk.
    3. Toast: prepared by toasting bread and adding butter and/or jelly.
    4. Eggs: prepared by scrambling, frying, poaching, or boiling them.
There is a difference between "identifying a source where this information can be found" and "this is the (copyrighted) source of the data that GPT used to draw upon to come up with the statement."

The first is an exercise for the reader (and much better done and evaluated by the reader). The second is what people are concerned about.


I'm concerned about both. I'm a "people".

> Why did you pick that index rather than some other source material?

I told you why references were important in scientific documents already.


Scientific documents - certainly. If you are writing a research paper or encyclopedia, I expect it to be well cited.

If you are writing something that is synthesizing knowledge (not just reporting the facts), the "where are all the places were that knowledge came from" is an impossible task for human or machine.

If I ask GPT to create a poem in the style of Roses are Red about coffee and bacon - why should that request need to be citied to the same degree of scrutiny as an encyclopedia or research paper?

If, on the other hand, you're trying to use GPT to write such a paper... I would hold that you're doing it wrong. It doesn't do that well. The model is "about" transforming language. To do so, it has a fair bit of 'knowledge' that it contains to be able to do that accurately. OpenAI makes no claims about the accuracy of the content that GPT produces (its improved, it can more accurately answer data - but if you want to know the answer it is no better than your next door neighbor who has read a lot).

If you are claiming that the example of Bacon is Greasy poem that GPT wrote is infringing any more than a child's "roses are red, my cat is orange, his eyes are green, nothing rhymes with orange" then I believe you will face an uphill battle.

To say that there is plagiarism and infringement going on - it needs examples rather than a "I think it works this way and is just regurgitating material it was fed from elsewhere."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: