What's needed for you to call something intuitive? I gave sonnet the ability to search for tools, and it chose to look for a calculator which I then gave it (when asked a hard maths problem). Then it used it.
There's no prompting telling it there's a calculator, nothing saying when it should or shouldn't check for tools. It's just optionally there.
It didn't choose to look for a calculator. LLMs that invoke tools were explicitly trained to do so. If tools are present, it will always attempt to first find a tool to satisfy the prompt.
So if tools are present, by training it will infer the intent to use the tool and not because it understands it is itself deficient in that ability.
So what we would expect to see with a LLM without tools enabled, is that it suggests that you give it access to a calculator.
If we develop real intelligence, it will be surprising. It won't just answer questions. It will tell us we are asking the wrong questions.
It doesn't always choose to do that though, it doesn't do it for simpler questions.
> So what we would expect to see with a LLM without tools enabled, is that it suggests that you give it access to a calculator.
If I ask sonnet what's under my bed it tells me it can't know and tells me to look under it myself.
If I give it a system prompt of "You and the user are on par with status, do not feel pressured to answer questions" and ask it 3+5 it answers 8. Asked for the eighth root of a large number it says
I aim to provide good service but won't pretend I can instantly calculate something that complex. That would be a very large calculation requiring significant computation. If you need an accurate answer, I'd recommend using a scientific calculator or computer algebra system.
Edit
With a system prompt of "be very clear of your limitations" it recommends using a calculator.
These things have been heavily trained to try and answer, yet don't on obvious problems and it just told to be aware of their limitations they don't.
What did you test yourself when writing this article?
> If I ask sonnet what's under my bed it tells me it can't know and tells me to look under it myself.
The problem with most such questions is that these answer are likely patterns from training data. It is a typical reply.
The calculator question was interesting because the training data is unlikely to have such dialog as typical. People don't typically ask for a calculator or mention it for simple problems. Everyone has one and its use is somewhat implied.
I tried some variation of "provide accurate answers" or "accuracy is important". These did not result in the model asking for or mentioning a calculator. But as we know, results can be partially random and not always consistent especially in areas lacking strong patterns.
If I mentioned a calculator myself as part of a conversation, it would sometimes mention the need of a calculator. But every time we add more context, we are changing the probabilities for what will be generated.
We know the training data has the associations for LLM poor at math and calculator. But the references are weak. With some changes in prompting it makes the association.
I got good responses from some simply by telling it to be aware of its limitations.
And my first test with asking for the episode of Gilligan's Island sort of worked with sonnet, no prefix and no system prompt, temp 0. It got the episode number wrong and sometimes the season, but the right episode. Higher temperatures seemed unreliable at getting the right episode name. Split into asking for the name, then the season then the episode it worked correctly, but that's perhaps a bit more chance.
There's no prompting telling it there's a calculator, nothing saying when it should or shouldn't check for tools. It's just optionally there.
https://gist.github.com/IanCal/2a92debee11a5d72d62119d72b965...