The author adressss this and argues that it misses the point:
> Now, some might interject here and say we could, of course, train the LLM to ask for a calculator. However, that would not make them intelligent. Humans require no training at all for calculators, as they are such intuitive instruments. We use them simply because we have understanding for the capability they provide.
So the real question behind the headline is why LLMs don't learn to ask for a calculator by themselves, if both the the definition of a calculator and the fact that LLMs are bad at math are part of the training data.
I have dyscalculia and I still have no clue about calculators except I was taught how to make it give me the answer to math problems, I'm a bit embarrassed to say: even still I sometimes take a few seconds to boot into being able to use one. We often discuss LLMs like there is no divergence in humans, I don't know how many people math is intuitive for, but I know plenty of people like me.
I do think it's interesting to think about why the LLM needs to be told to ask for a calculator and when to do that. And not just in individual prompts where humans manually prompt ask it to "write some code to find the answer" but in general.
We often use the colloquial definition of training to mean something to the effect of taking an input, attempting an output, and being told whether that output was right or wrong. LLMs extend that to taking a character or syllable token as input, doing some computation, predicting the next token(s), and seeing if that was right or wrong. I'd expect the training data to have enough content to memorize single-digit multiplication, but I'd expect it to also learn that this model doesn't work for multiplying an 11 digit number by a 14 digit number.
The "use a calculator" concept and "look it up in a table" concepts were taught to the LLM too late and it didn't internalize that as a way to perform better.
This still doesn't get at the point, with this example you've effectively constructed a prompt along the lines of: "Note: A calculator is available upon request wink, here's how you'd use it: ... Now, what's the eighth root of 4819387574?"
Of course the model will use the calculator you've explicitly informed it of. The article is meant to be a critique of claims that LLMs are "intelligent," when, despite knowing their math limitations, don't generally answer "You'd be better off punching this into a calculator" when asked a problem
How have I told it there's a calculator? All I've given it is the ability to search for tools and enable ones it wants.
> Of course the model will use the calculator you've explicitly informed it of
I didn't. I also gave it no system prompt pushing it to always use tools or anything.
It searches for tools with a query "calculator math root" and is given a list of things that includes a calculator. It picks the calculator, then it uses it.
I see, that clarifies things for me, it's not quite like the example I gave then.
Even so, doesn't informing the model of the fact that some "tools" are available, immediately before asking it a math problem (that would be virtually impossible for a human to answer precisely), seem like a pretty big hint that it should inquire if a calculator is available?
Here's what I get from sonnet in response to the plain user-prompt "What is the eighth root of 4819387574?"
"""
Let me solve this step by step.
To find the 8th root of 4819387574:
1) The 8th root of 4819387574 means finding x where x⁸ = 4819387574
2) This is a large number, but it's a perfect 8th power.
3) One way to approach this is to find factors:
4819387574 = 13⁸
> Now, some might interject here and say we could, of course, train the LLM to ask for a calculator. However, that would not make them intelligent. Humans require no training at all for calculators, as they are such intuitive instruments. We use them simply because we have understanding for the capability they provide.
So the real question behind the headline is why LLMs don't learn to ask for a calculator by themselves, if both the the definition of a calculator and the fact that LLMs are bad at math are part of the training data.