What about both? Or say a set of standard tools a modern intelligent agent[0] should have some proficiency in. A calculator, a basic code interpreter for a single high-level language, a graphing tool[1], web search, database search. And then maybe a tool for managing its own context[2]. How far could we get with a dataset designed specifically to train the model in pure tool use? That is, one that assumes the model never actually knows the answer to a question (even if the base model does), and instead trains it to aggressively use tools to break the problem down into steps[3] - steps that are primarily more tool calls, to query external sources, process information, simulate, etc. until the answer is computed. No direct answers, just tool calls glued by thinking in terms of tool calls, or thinking by tool calls.
I wonder if this has been tried. It probably has, seeing how hot this area of research is today. If anyone knows of a paper or a dataset, I'd appreciate a link.
Anyway, I wonder what would happen if we tried it with this method - basically retraining the model to trust its own toolbox - or as some would say, "shut up and multiply" - and do it across all tasks, not strictly math or coding ones.
--
[0] - Digital or otherwise.
[1] - Or the one tool that does all three, and which most people older than ~25 y.o. likely used at least once in their lives: Microsoft Excel. Or any other spreadsheet app. Though for LLMs as they are now, I suppose code interpreter would be a better unifying paradigm due to being 1D instead of 2D.
[2] - E.g. changeNotesAndRethink("text", 0, 1) -> replace current output with "text", continue generation; changeNotesAndRethink("text", -1, 2) -> replace fixed "assistant notes prompt" with "text" and discard last two outputs[4] and continue, etc. Honestly, I'm surprised I haven't seen it done so far - not in the popular places I know, at least (vendor apps, TypingMind, ComfyUI); I've heard of some attempts long ago (back when LangChain was still seen as hot). Did giving the model control over the chat loop never pan out? Or is there some fundamental reason this doesn't work?
[3] - I may have accidentally done this in-context with Claude 3.5 Sonnet - if I prompt it for chain-of-thought and happen to have Mermaid Diagram plugin enabled in TypingMind, it almost always ends up producing multiple diagrams as part of the CoT phase. Notably, this doesn't happen with my own equivalent plugin (PlantUML), so I wonder if it's just something about that specific tool, or if "thinking with (Mermaid) diagrams" was part of the training set.
EDIT:
[4] - APIs for tool-using models seem to allow several LLM outputs in a row. But that makes me think (and I apologize for this post being almost all footnotes, but ideas just keep coming) - what about rewinding back past one or more user messages in a multi-turn conversation, while retaining them? Like "Fill in the Middle" mode[5], just over entire conversation instead of a single message?