I think the argument was, GPT4 can't learn to do Math from more data. I'd be sur...

threeseed · on June 30, 2024

ChatGPT makes mistakes doing basic arithmetic or sorting numbers.

Pretty sure we have enough data for these fundamental tasks.

pestaa · on June 30, 2024

It's more than enough data for a specialized tool, yes.

It's not even remotely enough data for a statistical language processor.

derefr · on June 30, 2024

Why are young children able to quickly surpass state-of-the-art ML models at arithmetic tasks, from only a few hours of lecturing and a "training dataset" (worksheets) consisting of maybe a thousand total examples?

What is happening in the human learning process from those few thousand examples, to deduce so much more about "the rules of math" per marginal datapoint?

vidarh · on July 1, 2024

Are they? Even before OpenAI made it hard to force GPT to do chain of thought for basic maths it usually took over a dozen digits per number before it messed up arithmetic when I tested it.

How many young children do you genuinely think would do problems like that without messing up a step before having drilled for quite some time?

I'm sure there are aspects to how we generalise that current LLM training processes does not yet capture, but so much of human learning processes involve repeating very basic stuff over and over again and still regularly making trivial mistakes because we keep tripping over stuff we learned how to do right as children but keep failing to apply it with sufficient precision.

Frankly, making average humans do these kind of things consistently right manually even for small numbers without putting a process of extensive checking and revision around it is an unsolved problem. And convincing an average human apply that kind of tedious process consistently is an unsolved problem.

derefr · on July 1, 2024

> How many young children do you genuinely think would do problems like that without messing up a step before having drilled for quite some time?

You're overestimating how many examples "drilled for quite some time" represents. In an entire 12 years of public school, you might only do a few thousand addition problems in total. And yet you'll be quite good at arithmetic by the end. In fact, you'll be surprisingly good at arithmetic after your first hundred!

> I'm sure there are aspects to how we generalise that current LLM training processes does not yet capture, but so much of human learning processes involve repeating very basic stuff over and over again and still regularly making trivial mistakes because we keep tripping over stuff we learned how to do right as children but keep failing to apply it with sufficient precision.

LLMs fail when asked to do "short" addition of long numbers "in their heads." And so do kids!

But most of what "teaching addition" to children means, is getting them to translate addition into a long-addition matrix representation of the problem, so they can then work the "long-addition algorithm" one column at a time, marking off columns as they process them.

Presuming they can do that, the majority of the remaining "irreducible" error rate comes from the copying-numbers-into-the-matrix step! (And that can often be solved by teaching kids the "trick" of inserting commas into long numbers that don't already have them, so that they can visually group and cross-check numbers while copying.)

LLMs can be told to do a Chain-of-Thought of running through the whole long-addition algorithm the same way a human would (essentially, saying the same things that a human would think to themselves while doing the long-addition algorithm)... but for sufficiently-large numbers (50 digits, say) they still won't perform within an order-of-magnitude of a human, because "a bag of rotary-position-encoded input tokens with self-attention, where the digits appear first as a token sequence, and then as individual tokens in sentences describing the steps of the operation" is just plain messier — more polluted with unrelated stuff that makes it less possible to apply rigor to "finding your place" (i.e. learn hard rules as discrete 0-or-1 probabilities) — than an arbitrary-width grid of digits representation is.

People — kids or not — when asked to do long addition, would do it "on paper": using a constant back-and-forth between their Chain-of-Thought and their visual field, with the visual field acting as a spatially-indexed memory of the current processing step, where they expect to be able to "look at" a single column, and "load" two digits into their Chain-of-Thought that are indirected by their current visual attention cursor — with their visual field having enough persistence to get them back to where they were in the problem if they glance away; and yet with the ability to arbitrarily refocus the "cursor" in both relative and absolute senses depending on what the Chain-of-Thought says about the problem. Given an unbounded-length "paper" to work on, such a back-and-forth process can be extended to an unbounded-length processing sequence robustly. (Compare/contrast: a Turing machine's tape head.)

Pure LLMs (seq2seq models) cannot "work on paper."

If you consider what is even theoretically possible to "model" inside a feed-forward NN's weights — it can certainly have the successive embedding vectors act as "machine registers" to track 1. a set of finite-state machines, and 2. a set of internal memory cells (where each cell's values are likely represented by O(N) oppositional activations of vector elements representing each possible value the cell can take on.) These abstractions together are likely what allow LLMs to perform as well as they do on bounded-length arithmetic. (They're not memorizing; they're parsing!)

But given the way feed-forward seq2seq NNs work, they need a separate instance of these trained weights, and their commensurate embedding vector elements, for each digit they're going to be processing. Just like a parallel ALU has a separate bit of silicon dedicated to processing each bit of the input registers, an LLM must have a separate independent probability model for the outcome of applying a given operation to each digit-token "touched" on the same layer. Where any of these may be under-trained; and where, if (current, quadratic) self-attention is involved, the hidden-layer embedding-vector growth caused by training to sum really big numbers, would quickly become untenable. (And would likely be doubly wasted, representing the registers for each learned arithmetic operation separately, rather than collapsing down into any kind of shared "accumulator register" abstraction.)

---

That being said: what if LLMs could "work on paper?" How would that work?

For complete generality — to implement arbitrary algorithms requiring unbounded amounts of memory — they'd very likely need to be able to "look at the paper" an unbounded number of times per token output — which essentially means they'd need to be converted at least partially into RNNs (hopefully post-training.) So let's ignore that case; it's a whole architectural can of worms.

Let's look at a more limited case. Assuming you only want the LLM to be able to implement O(N log N) algorithms (which would be the limit for a feed-forward NN, as each NN layer can do O(N) things in parallel, and there are O(log N) layers) — what's the closest you could get to an LLM "working on paper"?

Maybe something like:

• adding an unbounded-size "secondary vector" (like the secondary vector of a LoRA), that isn't touched in each step by self-attention, and that starts out zeroed,

• with a bounded-size "virtual memory mapping" — a dynamic and windowed position-encoding of a subset of the vector into the Q/K vectors at each step, and a dynamic position-encoding of part of the resulting embedding (Q.KT.V) that maps a subset of the embedding vector back into the secondary vector

• where this position-encoding is "dynamic" in that, during training of each layer, that layer has one set of embedding vectors that it learns as being a "input-vocabulary memory descriptor table", describing the virtual-memory mappings of the secondary vector's state-at-layer-N into the pre-attention vector input at layer N [i.e. a matrix you multiply against the secondary vector, then add the result to the pre-attention vector]; and an equivalent "output-vocabulary memory descriptor table", mapping the post-attention embedding vector to writes of the secondary vector [i.e. a matrix you multiply against the post-attention embedding vector, then add to the secondary vector]

• and where the secondary vector is windowed, in that both memory-descriptor-table matrices are indicating positions in a window — a virtual secondary vector that actually exists as a 1D projection of a conceptually-N-dimensional slice of a physical secondary N-dimensional matrix; where each pre-attention embedding contains 2N elements interpreted as "window bounds" for the N dimensions of the matrix, to derive the secondary vector "virtual memory" from its physical storage matrix; and where each post-attention embedding contains 2N elements either interpreted again as "window bounds" for the next layer; or interpreted as "window commands" to be applied to the window (e.g. specifying arbitrary relative affine transformations of the input matrix, decomposed into separate scaling/translation/rotation elements for each dimension), with the "window bounds" of the next layer then being generated by the host framework by applying the affine transformation to the existing window bounds. (And again, with the output window bounds/windowing command parameters being learned.)

I believe this abstraction would give a feed-forward NN the ability to, once per layer,

1. "focus" on a position on an external "paper";

2. "read" N things from the paper, with each NN node "loading" a weight from a learned position that's effectively relative to the focus position;

3. compute using that info;

4. "write" N things back to new positions relative to the focus position on the paper;

5. "look" at a different focus position for the next layer, relative to the current focus position.

This extension could enable pretty complex internal algorithms. But I dunno, I'm not an ML engineer, I'm just spitballing :)

hbs18 · on July 1, 2024

You can't reliably teach an LLM maths the same way you can't take a locomotive offroading.