I can't speak to whether or not Nature should publish this, but I don't find the outcomes in the paper spectacular. In brief, they create an ontology -- in the initial examples, four made up color words, each corresponding to a real color and then three made up 'function' words, representing 'before', 'in the middle of' and 'triple'.
They then ask humans and their AI to generate sequences of colors based on short and progressively longer strings of color words and functions.
The AI is about as good at this as humans, which is to say 85% successful for instructions longer than three or four words.
The headline says that GPT4 is worse at this than MLC. I am doubtful about this claim. I feel a quality prompt engineer could beat 85% with GPT4.
The claims in the document are that MLC shows similar error types to humans, and that this backs some sort of theory of mind I don't know anything about. That's as may be.
I would be surprised if this becomes a radical new architecture based on what I'm reading. I couldn't find a lot of information as to size of the network used; I suppose if its very small that might be interesting. But this read to me very like an 'insider' paper, its target other academics who care about some of these theories of mind, not people who need to get sentences like 'cax kiki pup wif gep' turned into real color circles right away.
>I feel a quality prompt engineer could beat 85% with GPT4.
I generally don't pull the "you're not prompting it good enough" card unless i have direct experience with the task but this does raise a few flags - "between 42 and 86% of the time, depending on how the researchers presented the task."
I would have liked them to show how they presented this task, at least something more than a single throw away line, given how integral it is to the main claim.
I don't find this argument persuasive. If you need to engineer the prompt in order for the model to give you the answer you want, and if it's performing poorly without that kind of assistance from a human who understands the task and the desired outcome... then it's perfectly OK to not bother and give the LLM a low mark on a benchmark. After all, the whole point was to test the LLM's language and reasoning skills.
"Prompt engineering" makes sense if your goal is more utilitarian, but it's not a license to cheat on tests. If you want a machine learning model to generate aesthetically pleasing images, or to follow certain rules in a conversation, then it's OK to rely on hacky solutions to get there.
I kind of dislike the term "Prompt Engineering". It mystifies what to me is an ordinary process. I'm not talking about chanting magic words, i'm just talking about taking the structure of an LLM into account, even taking how humans would approach the problem into account.
In microsoft's agi paper, there is a test for planning they put out for GPT-4. They give detailed instructions on the constraint of a poem and expect it to spit out the constrained poem in one shot. Of course it fails and the conclusion is that it can't plan. But if you think about it, this is a ridiculous assertion. No human on earth would have passed that test with working memory alone. Even by our standards, it's a weird way to present the information but they did so anyway. This is actually a major issue for most plan benchmark assessment for LLMs i've come across.
These are the kind of problems i'm talking about. If i changed the request to encourage for a draft/revise generation process and it consistently passed these tests then it's very fair to say it can plan. That's not "hacky". It's just truth. and saying otherwise is denying real world usage for a misplaced sense of propriety to benchmarks. a benchmark is only as useful as the capabilities it can assess.
a "this is how this is presented, this is how the model performed" would have been very prudent in my opinion. Maybe these guys already covered the kind of things i'm talking about(i can accept that!)....but i can't know that if they don't tell us.
how is "Engineering" "about chanting magic words"? The wires may be getting crossed on if it's an engineering discipline or just the use of the word engineer, meaning to build mindfully, I feel the latter is appropriate and the former more misguided but still it's not "Prompt Wizarding" ;)
Precisely! You’re trying to learn the art of poking the box in the right way. As an engineer, I really despise the bastardisation of the term to refer to jobs or tasks that are inherently non-scientific.
I think adjusting the prompt makes sense. We have to keep in mind that this implicitly happens on the human side as well. The person creating the initial question will consider how understandable it is for them and edit until they're happy. Then the questions will be reviewed by co-authors and possibly tested on a few people before running the full experiment. They'd probably review the explanation again or change the task if they realised half the people can't follow the instructions.
We just call it "editing for clarity" rather than "prompt engineering" when we write for other people.
I would claim that most people are very bad at communicating in a way that can be interpreted, by human or AI, with a single shot. Most of the time some back and forth is required to clear things up. I think it's unfair to expect gold from garbage input, since meat brains can't do that either, without some back and fourth.
Those prompts should be available. It's ridiculous that they're not.
So what if we have an AI prompter-assister... we take an input prompt: run it through an AI to transform that prompt into proper AI talk - then that is what's prompted to the actual AI... That way the AI will always have the best prompt possible, because the Prompt-helper helps your prompt...
While I think that's a valid point, for the purpose of the paper having a competent prompt writer would be needed to make a fair comparison.
I think the researchers would know how to get the best from their own AI bot, so that's a level of competency that should be extended to comparisons, otherwise user competency becomes a source of bias. I do feel you're correct in your concerns though, the systems shouldn't need experts to use them, nor should they need the user to already know the right answer, which leads me to my next point:
When it comes to real world expectations, perhaps instead we need a large group of random people (with no prior experience) working with each bot to complete a set of tasks in order to determine how it truly performs - something that could be enhanced if the answers weren't easy to check.
Disagree - a model may have capabilities that it only deploys in certain circumstances. In particular, if you train on the whole internet, you probably learn what both stupid and smart outputs look like. But you probably also learn that the modal output is stupid. So it’s no ding on the model’s capabilities if it defaults to assuming it should behave stupidly.
What you're basically saying is "the model as trained can't do well at this task, so let's use our own cognitive skills to help it."
That's a problem for comparative benchmarking, right? You're no longer testing the model; you're testing the model in tandem with the prompt engineer. This raises several big questions:
1) How do you know that the engineer's knowledge of the "correct" answer isn't being subtly encoded in the prompt? Essentially the Clever Hans phenomenon.
2) If you want this to be a fair game, how do you give precisely the same kind of advantage to whatever you're benchmarking the LLM against?
3) Last but not least, if you're not going to throw in your prompt engineer for free with the product, will your results be reproducible by your customers?
To be clear, I don't think there's some cosmically objective way of doing this. If you're using prompts written for humans, you're already putting the computer at a disadvantage. But at the very least, you're measuring something meaningful: how will the model behave in the real world.
The supplement discusses how they presented the task. Notably, they first gave all the training examples and then told the model what the task is and didn't ask it to reason or create any context before spitting out the answers. So basically the simplest way to interact with it, but probably not a great way to get solutions for this problem if that was the task at hand.
Agreed. And there's going to be some significant domain decisions for GPT-4 to consider: Will the made up words be existing single tokens? Will those single tokens be heavily overloaded tokens, e.g. a letter of the alphabet? Will the representation of the circles be, say "B" for blue, or something else?
Along with this are questions as to whether you're going to treat GPT-4 as a zero shot or multi-shot oracle; while they have this idea of 'context free' challenges in the paper, they, crucially, train their network first.
Anyway, I like this paper less the more I think about it.
The article suggested that GPF-4 FAILED between 42 and 86 percent of the time. So its success rate would be 58 to 14 percent, which compared to the novel NN's 80 percent, seems significant.
1. The level of variance displayed here is very atypical for a language model unless you're making significant changes to how the information is presented. This alone is cause for more elaboration.
2. With this big a variance already showing, it's hard to say for sure that they hit the top end.
Maybe it seems strange but most LLM evaluation papers still don't actually bother to tie in even some extremely well known/basic "output boosters" like chain of thought and so on. But at least in these instances, we know how they presented it.
a "this is how we presented the task, this is how it performed" would have been very nice.
But isn’t that their point? That their new neural network needs less prompt crafting because it is part better at reading intent and meaning of words and part less likely to hallucinate?
Sure, I also think GPT-4 can be brute forced with a very precise prompt into generating correct answers but that’s not how you would describe the problem to a human, and I assume the humans performing well in this test weren’t either, but given a fairly informal description of the exercise.
>I couldn't find a lot of information as to size of the network used
In the original paper,[0] in the section called 'Implementation of MLC', there's a description of sorts (Greek to me):
"Both the encoder and decoder have 3 layers, 8 attention heads per layer, input and hidden embeddings of size 128, and a feedforward hidden size of 512. Following GPT63, GELU64 activation functions are used instead of ReLU. In total, the architecture has about 1.4 million parameters."
If I am evaluating how intelligent an AI is against a human then IMO it's only fair they both get the same prompt.
It should be counted as a fault of the AI if it can't understand the prompt as well as the human, or another AI. That's all part of being a useful entity, how well you understand a prompt and how well you can infer the original meaning from what prompt you have been given.
Honestly I’m surprised by how low Nature has fallen. Is it even a useful signal of paper quality anymore? NeurIPS and ICLR aren’t great but in general I’ve found their work to be more rigorous than that of Nature despite of the fact that they are shorter conference papers compared to Nature’s journal papers.
Right now sadly the only useful signal in Deep Learning research is research group. If OpenAI releases a paper I know it’s something good that works at scale, similarly if Kaiming He, Piotr Dollar and team in Meta AI release a paper, it tends to be really good and SOTA. Google Deepmind maintains high quality, Google brain has been more of a mixed bag. If Berkeley releases a paper, it has 50% chance of directly going to trash, Stanford has a much lower percent (I also disambiguate) based on specific groups. Of course I’m going to be heavily biased and this system is not great, but conferences and journals have managed to become such a useless signal that I find this method to be more accurate.
For anyone wondering about the architecture this is a 1.4 million parameter transformer model with both an encoder and decoder. The vocabulary size is comically small, it only understands 8 words.
It learns new ideas with very few examples, to the extent you can express new ideas with a vocabulary of eight words.
English speakers tend to have a vocabulary of 30k words or so depending on what exactly you measure.
GPT-4 has 1.7 trillion parameters.
Of course scaling up is quite unpredictable in what capabilities it gets you, but It's not that much of a stretch to imagine that a GPT-4 sized model would have a reasonably sized vocabulary. Certainly worth testing if you have the resources to train such a thing!
It's the mapping of parameters to vocabulary claim embedded in your response that needs validation. 1.7 trillion parameters means what exactly?
Let's start with working vocabulary. Working vocabulary doesn't just mean knowing n words. It means putting n words together in factorially many valid combinations to construct sentences. And 30k btw is an insane working vocabulary. Most people know 1000 words on average in English. All of their sentences are structured from those 1000 words. This is true for most cases, except certain ones like Mandarin or German, where basic words can be used to assemble more complex words.
Certainly GPT-4 knows something. Presumably that something can be mapped to a working vocabulary. How large a vocabulary that is requires a testable, reproducible hypothesis supported by experimental proof. Do you have such a hypothesis with proof? Does anyone? Until we do it's just a guess.
> Most people know 1000 words on average in English
Maybe I misunderstand but that sounds so stupid and wrong I don't know where to start. Standard vocab estimates are 15K - 30K, averaging about 20K (these from memory)
As a result, estimates vary from 10,000-17,000 word families[16][19] or 17,000-42,000 dictionary words for young adult native speakers of English.[12][17]
A 2016 study shows that 20-year-old English native speakers recognize on average 42,000 lemmas, ranging from 27,100 for the lowest 5% of the population to 51,700 lemmas for the highest 5%. These lemmas come from 6,100 word families in the lowest 5% of the population and 14,900 word families in the highest 5%. 60-year-olds know on average 6,000 lemmas more. [12]
According to another, earlier 1995 study junior-high students would be able to recognize the meanings of about 10,000–12,000 words, whereas for college students this number grows up to about 12,000–17,000 and for elderly adults up to about 17,000 or more.[20]
Does the average include people who don't speak English? If about 4% of the world's population are native speakers and the number of words known tails off after that I can imagine it could almost be approximately true. And maybe we are counting babies in English majority countries as native speakers but they haven't learned all their 20k words yet. Of course GP's point is still invalid in that case.
100% of people know the most common 1000 words. The remainder of those who know more words fall into a consistent curve across languages that follows Zipf’s law. This is different than “most people know 1000 words on average.”
I don't care to pay for access to gpt-4 but one could easily use one of the vocabulary estimation tests, which use some statistics plus knowledge of word appearance frequency, to estimate its vocabulary size. https://mikeinnes.io/2022/02/26/vocab is one such test which explains the statistics ideas, and there are many others based on similar principles.
I think a fairer comparison would be Toki Pona, a micro-language with ~120 words. you can express lots of things if you have a great deal of patience, Up-Goer Five style.
In logic there's also SKI combinator calculus, which is Turing-complete with three symbols, and unlike the Morse or A-Z examples, but like the Toki Pona example, each symbol has a semantic meaning.
If you just want to describe ideas in an abstract realm like sequences of colors, like this paper, it's not surprising you don't need many words.
There was a post on here a few months ago about training using single characters as tokens instead of words, and it worked really well, being able to create new Shakespeare-like text despite not using human words as tokens. What a (human) word is can be learned by the model instead of encoded in the training set.
The link is to a news article about a recent paper not the a link to the research itself.
The editorialization of this being a breakthrough was done by the reporter not the authors of the paper.
I see people here comparing this to GPT-4 but I can't find that section in the actual paper nor in the underlying data. Can someone point me in the right direction?
The article only says
"To make the neural net human-like, the authors trained it to reproduce the patterns of errors they observed in humans’ test results.",
this is not much.
I'd especially like to know the deciding factor why it worked while LLM failed.
By contrast, GPT-4 struggled with the same task, failing, on average, between 42 and 86% of the time, depending on how the researchers presented the task. “It’s not magic, it’s practice,”
This seems dumb. Sure, an LLM can't learn a new word. An LLM isn't an entire system. You could make a system (which extensively used an LLM) to do this fairly easily.
What you are seeing a nice trick from the chatGPT application where they are keeping the history of your conversation (not in it's entirety, but a sort of summarized version), and in a single turn feeding it all back in.
In a single turn it can be told to use word x to mean y and generate output, but I don't think that's "learning".
It’s called in-context learning. I wouldn’t call it a trick so much as the most exciting and useful feature of LLMs. A lot of research is being done to expand the context window which would allow you to prime the model with entire books or codebases. GPT4 has a max context window of 32k tokens up from 2k with GPT3.5. Whose to say iteratively shifting the output distribution isn’t a form of learning.
If you want to call in-context learning learning, that's fine. I thought the presented paper was about the model being updated so that it could be used in "fresh contexts", but I might have misunderstood what they meant by "fresh context". I thought they meant next prompt.
I think everyone is aware that if you say "I use the word jimbo to mean dog. Write a story about dogs using my way" it will work.
The model wasn’t update after the initial training, they’re definitely using in-context learning. The advancement here, as I understand it, is that they’ve achieved in-context learning with a very small model. From the paper:
>> Few-shot instruction-learning task. MLC was evaluated on this task in several ways; in each case, MLC responded to this novel task through learned memory-based strategies, as its weights were frozen and not updated further.
I have a theory we can get the same levels of output with a optimized pipeline of various specialozed chain. Output is forced into a ruleset to mimick english grammar. Hpt even seems to say this pattern has a lot of potential
This guy is a science journalist, the authors of the source research paper are more relevant if you're trying to infer anything about the material from the authors: https://www.nature.com/articles/s41586-023-06668-3
FWIW I think even the original paper is more of a master class in riding an academic bubble than anything of important substance. I'm just not sure what looking at other articles by this journalist is supposed to convey.
splendid sales headline - however, human communication is varied and multi-layered. Since there are too many angles to summarize -- suffice it to say that IMHO a healthy skepticism is warranted, along with evidence and test results.
I will take this article with a bag of salt but I think we are very close to solving the human communication problem. Not because the machines are perfect but because the human, our benchmarks, are prone to mistakes and so a "close enough" machine would pass as a human just fine.
There is inherently no solution to the human communication problem. The mistake has always been in thinking technology would unite us and improve communication when clearly it's doing the opposite.
That's a category error. You don't submit your manuscript to a prestige journal in order to publish it but to take advantage of their PR machinery. The services Nature provides are very good and cost-effective, but you have to meet their standards.
As research papers, the papers published in Nature/Science/etc. are rarely that useful. Because the journals target the general scientifically literate audience, the interesting details are usually hidden in supplementary material. And the supplements are often little more than giant info dumps that nobody has paid sufficient attention to.
It would be nice to live in a world where such PR services are unnecessary, but that's not the world we are living in. The academia is very competitive, and there is never enough funding and never enough good jobs.
So to retain their relevance you claim they abandon academic standards and sensationalize an openly accessible study that anyone can read and thus dispute the article. Okay.
It isn't exactly unheard of for them to do that. They'll just wipe their hands of it afterwards if it turns out to be bad science by saying that their goal isn't to publish research that's guaranteed correct.
I don't think it is quite that intentional, but just as human economic systems regularly create bubbles there are blind spots in the academic science collective. People desperately want to get in on LLMs right now.
I am no expert in this field, but this seems an important result, even if on a limited model.
I don't think that anyone was very sure if generalisation was actually happening with current architectures. Happy to be corrected by anyone who knows more.
I can't speak to whether or not Nature should publish this, but I don't find the outcomes in the paper spectacular. In brief, they create an ontology -- in the initial examples, four made up color words, each corresponding to a real color and then three made up 'function' words, representing 'before', 'in the middle of' and 'triple'.
They then ask humans and their AI to generate sequences of colors based on short and progressively longer strings of color words and functions.
The AI is about as good at this as humans, which is to say 85% successful for instructions longer than three or four words.
The headline says that GPT4 is worse at this than MLC. I am doubtful about this claim. I feel a quality prompt engineer could beat 85% with GPT4.
The claims in the document are that MLC shows similar error types to humans, and that this backs some sort of theory of mind I don't know anything about. That's as may be.
I would be surprised if this becomes a radical new architecture based on what I'm reading. I couldn't find a lot of information as to size of the network used; I suppose if its very small that might be interesting. But this read to me very like an 'insider' paper, its target other academics who care about some of these theories of mind, not people who need to get sentences like 'cax kiki pup wif gep' turned into real color circles right away.