Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you asked me to add two large numbers in my head and give my best guess I might not do any better than GPT3 does. And I think I could probably do better than the average person.


You would give something in the ballpark, not gibberish.


Sure, but in your head is a squishy organ, and in GPT3's "head" is a bit of silicon purpose-built to make such calculations.


The multiplication problem can be solved by spacing the digits (to enforce one token per digit) and asking the model to do the intermediate steps (chain-of-thought). It's not that language models can't do it. With this method LMs can solve pretty difficult math, physics, chemistry and coding problems.

But without looking at individual digits and using pen and paper we can't do long multiplications either. Why would we ask a language model to do it in one step?


Because a sufficiently intelligent model should be able to figure out that there are intermediate steps towards the goal and complete them autonomously. That's a huge part of general intelligence. The fact that GPT-3 has to be spoon fed like that is a serious indictment of its usefulness/cleverness.


This has the same flavor as the initial criticisms of AlphaGo.

"It will never be able to play anything other than Go", cue AlphaZero.

"It will never be able to do so without being told the rules", cue MuZero.

"It will never be able to do so without obscene amounts of data", cue EfficientZero.

---

It has to be spoonfed because

1) It literally cannot see individual digits. (BPEs)

2) It has to infer context (mystery novel, comment section, textbook)

3) It has a fixed compute-budget per output. (96 layers per token, from quantum physics to translation).

To make Language Models useful one must either:

Finetune after training (InstructGPT, text-davinci-002) to ensure an instructional context is always enforced...

...Or force the model to produce so called "chains-of-thought"[1] to make sure the model never leaves an instructional context...

...Or force the model to invoke a scratchpad/think for longer when it needs to[2]...

...Or use a bigger model.

---

It's insane that we're at the point where people are calling for an end to LLMs because they can't reliably do X (where X is a thing it was never trained to do and can only do semi/totally unreliably as a side effect of it's training).

Ignoring of course that we can in fact "teach"/prompt(/or absolute worst case finetune) these models to perform said task reliably with comparatively little effort. Which in the days before GPT-3 would be a glowing demonstration that a model was capable of doing/learning a task.

Nowadays, if a (PURE NEXT-TOKEN STATISTICAL PREDICTION) model fails to perfectly understand you and reliably answer correctly (literally AGI) it's a "serious indictment of its usefulness".

Of course we could argue all day about whether the fact that a language model has to be prompted/forced/finetuned to be reliable is a fatal flaw of the approach or an inescapable result of the fact that when training on such varied data, you at least need a little guidance to ensure you're actually making the desired kinds of predictions/outputs...

...or someone will find a way to integrate chain-of-thought prompting, scratchpads, verifiers, and inference into the training loop, setting the stage for obsoleting these criticisms[3]

---

"It will never be able be able to maintain coherence", cue GPT-3.

"It will never be able to do so without being spoonfed", cue Language Model Cascades.[3]

So what's next?

"It will never be able to do so without obscene amounts of data". Yeah for sure, and that will never change. We'll never train on multimodal data[4], or find a scaling law that lets us substitute compute for data[5], or discover a more efficient architecture, or...

LLMs are big pattern matchers (~100B point neuron "synapses" vs ~1000T human brain synapses) that copy from/interpolate their ginormous datasets (less data than the optic nerve processes in a day), whose successes imply less than their failures (The scaling laws say otherwise).

[1] https://arxiv.org/abs/2201.11903

[2] https://twitter.com/OriolVinyalsML/status/101752320805926092...

[3] https://twitter.com/dmdohan/status/1550625515828088838

[4] https://www.deepmind.com/publications/a-generalist-agent

[5] https://arxiv.org/abs/2206.14486


“Your head evolved to pass and receive electric charges around at very high rates, why can’t you just tune yourself to AM radio waves.”




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: