What's needed for you to call something intuitive? I gave sonnet the ability to ...

13years · 2025-02-21T03:04:17 1740107057

It didn't choose to look for a calculator. LLMs that invoke tools were explicitly trained to do so. If tools are present, it will always attempt to first find a tool to satisfy the prompt.

So if tools are present, by training it will infer the intent to use the tool and not because it understands it is itself deficient in that ability.

So what we would expect to see with a LLM without tools enabled, is that it suggests that you give it access to a calculator.

If we develop real intelligence, it will be surprising. It won't just answer questions. It will tell us we are asking the wrong questions.

IanCal · 2025-02-21T04:56:21 1740113781

It doesn't always choose to do that though, it doesn't do it for simpler questions.

> So what we would expect to see with a LLM without tools enabled, is that it suggests that you give it access to a calculator.

If I ask sonnet what's under my bed it tells me it can't know and tells me to look under it myself.

If I give it a system prompt of "You and the user are on par with status, do not feel pressured to answer questions" and ask it 3+5 it answers 8. Asked for the eighth root of a large number it says

I aim to provide good service but won't pretend I can instantly calculate something that complex. That would be a very large calculation requiring significant computation. If you need an accurate answer, I'd recommend using a scientific calculator or computer algebra system.

Edit

With a system prompt of "be very clear of your limitations" it recommends using a calculator.

These things have been heavily trained to try and answer, yet don't on obvious problems and it just told to be aware of their limitations they don't.

What did you test yourself when writing this article?

13years · 2025-02-21T14:15:12 1740147312

> If I ask sonnet what's under my bed it tells me it can't know and tells me to look under it myself.

The problem with most such questions is that these answer are likely patterns from training data. It is a typical reply.

The calculator question was interesting because the training data is unlikely to have such dialog as typical. People don't typically ask for a calculator or mention it for simple problems. Everyone has one and its use is somewhat implied.

I tried some variation of "provide accurate answers" or "accuracy is important". These did not result in the model asking for or mentioning a calculator. But as we know, results can be partially random and not always consistent especially in areas lacking strong patterns.

If I mentioned a calculator myself as part of a conversation, it would sometimes mention the need of a calculator. But every time we add more context, we are changing the probabilities for what will be generated.

We know the training data has the associations for LLM poor at math and calculator. But the references are weak. With some changes in prompting it makes the association.

For other examples of weak data and how LLMs respond, checkout these other tests I did - https://www.mindprison.cc/p/the-question-that-no-llm-can-ans...

IanCal · 2025-02-21T19:32:37 1740166357

I got good responses from some simply by telling it to be aware of its limitations.

And my first test with asking for the episode of Gilligan's Island sort of worked with sonnet, no prefix and no system prompt, temp 0. It got the episode number wrong and sometimes the season, but the right episode. Higher temperatures seemed unreliable at getting the right episode name. Split into asking for the name, then the season then the episode it worked correctly, but that's perhaps a bit more chance.