DynaSaur: Large Language Agents Beyond Predefined Actions

throwup238 · on Dec 1, 2024

It looks like the key insight here is to have the LLM generate its own tools (as in GPT/Claude tool calling) via Python code generation and apply cosine similarity RAG to select which tools are available at each step using the tool description and the problem/step while using recent history to error correct.

The agent starts with some human created tooling like a tool to read the file system or create another tool using python code, then starts accumulating custom Python functions it wrote itself with tool calling metadata like descriptions and input/output types. Every time it reaches a step, if it doesn't find a relevant tool, it creates a new one. Apparently this improves performance on complex tasks (via GAIA benchmark) with diminishing returns on simpler tasks.

IanCal · on Dec 1, 2024

I played around making these things before, it's a fun exercise. Interesting to see that's where things may be heading.

My example was asking for a poem about the headlines (good example of info they don't have, and something that's very hard to do mechanically).

https://news.ycombinator.com/item?id=37015591

llm_trw · on Dec 1, 2024

I ended up training a bert on nothing but python for the embedding search. The results were crap. Then I used an llm to write a new docstring for each class/function definition in the training data and the results were better than state of the art.

There's so much wide open space to explore. It's a shame that everyone is wasting their time with the biggest possible models they can afford.

digdugdirk · on Dec 1, 2024

Do you have any more detailed info on this process? I've played around with using LLMs, but nothing in the training realm. I'd love to see a writeup or guide to the process you used there.

llm_trw · on Dec 1, 2024

No and it won't do you much good even if I did.

The tools have broken again since then - thanks tensorflow data loaders - and my code only works against a version of python that's no longer supported in LTS Ubuntu/Debian10+.

I have been mulling about running a subscription service where you get up to date code that works on topics like the above. If you're interested drop me a line at my profile email and I'll add you to a mailing list when/if I ever get around to doing it.

thom · on Dec 1, 2024

Seems like you could go further than this with something like DSPy and start evaluating which tools contribute to successful outcomes. Funny how much things start to look like Eurisko the more time goes on.

mountainriver · on Dec 1, 2024

This is what Voyager did awhile back, it’s interesting but I think only part of the answer

80hd · on Dec 1, 2024

Putting this idea out there, haven't seen anyone implement it:

Use vector embeddings to represent each task as a story, an abstraction of 1. the past, 2. the present, 3. the future - on a kind of global "story map".

Each embedding would be generated by all available sense inputs at a point in time. The most useful embeddings alg will be able to combine sight, hearing, internal monologue, visual imagination etc into one point on a high-dimensional map.

At each time step, find the closest successful "memory" (based on embedding of 1+2+3) and do some LLM exploration to adapt the memory to the new, novel situation.

Attempt the new "story", and do something like A* to get closer to the desired "future", tweaking the story each time and plotting failed attempts on the embedding map.

Theory being that over time, the map will become populated with successful attempts and embedding will be able to abstract between similar situations based on 1+2+3.

I'm not the guy to implement it, and I imagine new models training with a "reasoning step" are doing a similar thing at training-time.

johnsutor · on Dec 1, 2024

Interesting idea. Similarly, recent work appears to have used MCTS to explore sequential multi-agent systems (see https://arxiv.org/abs/2410.10762, https://arxiv.org/abs/2410.17238).

bongodongobob · on Dec 1, 2024

What do you mean by a story? Like a book?

80hd · on Dec 2, 2024

Story in the sense that we understand everything (perhaps even our most fundamental perceptions) through stories - events described over time with meaning/significance ascribed to particular things. There's a beginning, middle and end - in its most basic form.

If we model "situations" in AI in a similar way, my intuition tells me it would be similarly useful.

adtac · on Dec 1, 2024

The paper evaluates itself on the GAIA benchmark and it was my first time hearing about it, so I tried to evaluate myself as a human.

Here's a level 3 question from the GAIA paper (level 3 = hardest):

>In NASA’s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute? Exclude any astronauts who did not spend any time in space. Give the last name of the astronaut, separated from the number of minutes by a semicolon. Use commas as thousands separators in the number of minutes.

I timed myself solving the problem. It took me 9 minutes, 5 Google searches, 14 web pages, multiple Ctrl+F in these pages and 1 calculator use to figure out the answer.

DynaSaur seems to have a 10% to 20% success rate at this level.

Try for yourself. This is one of the few empirically grounded reference levels for how far we are from AGI.

ethbr1 · on Dec 1, 2024

That seems similar to a ~7th grade reading comprehension question, if all the facts where at hand.

Out of curiosity, if anyone knows, what's SOTA for how well LLMs actually parse (English) grammar? In the way they're looking at the prompt.

A lot of correctness to the challenge questions seems to be identifying key phrases and requests. I.e. reading comprehension.

And multi-step tool use requires a higher bar than straight summarization, as one must more particularly differentiate between alternative information to focus on.

adtac · on Dec 1, 2024

The question above was not preceded by anything; that was the whole question. The facts are at hand in the sense that you have the internet and you're allowed to use it. The hard part is knowing what to search and recognising the answer when you see it. This is much harder than any 7th grade comprehension test I've done :)

golol · on Dec 1, 2024

I don't like the way LLM papers are written. LLMs receive inputs and produce outputs that are best represented as plaintext with some special characters. Simply showing a few examples of the agent's core LLM text continuation job would explain the architecture much better than figures. I can't help but feel that the authors which do this are intentionally obfuscating things.

ukuina · on Dec 1, 2024

I suspect they do this to give the paper more weight than a mere prompt deserves. As an example:

> Given a task u \in \mathcal{U} and a human-designed action set \mathcal{A}u with R \in \mathcal{A}u , at time step t , we sample a thought-action pair (h_t, a_t) \sim \pi\theta(a_t \mid \mathcal{A}u, u, c{t-1}) following the ReAct framework (Yao et al., 2023b). Here, c{t-1} = \{(h_1, a_1, o_1), \dots, (h_{t-1}, a_{t-1}, o_{t-1})\} represents the interaction history up to time t-1 . The action a_t is executed, and an observation o_t is returned from the environment, updating the context to c_t = c_{t-1} \cup \{(h_t, a_t, o_t)\} . If a_t contains a new function not present in \mathcal{A}_{t-1}^g , we update the generated action set by setting \mathcal{A}t^g = \mathcal{A}{t-1}^g \cup f(a_t) , where f(a_t) denotes the set of functions defined in action a_t .

This is a roundabout way to say: "We pick an action based on what’s happened so far, do it, see the result, and update the history. If it’s something new, we add it to the list of actions we can use."

MattPalmer1086 · on Dec 1, 2024

Hah, I had a paper published this year. My co-authors are academics but I am not. Honestly, I couldn't understand the first version of the paper we wrote, despite inventing the algorithm it described!

There is definitely a certain language and a precise mathematical approach which is needed to pass review for academic papers. It isn't nonsense, but does obfuscate obvious meanings.

NitpickLawyer · on Dec 1, 2024

The funny thing is that we're getting closer to being able to give that paragraph to an LLM and have it spit out the simpler explanation.

This is what chatgpt gave me for the prompt "can you explain this in two sentences". It's pretty close to what you wrote.

> The system follows the ReAct framework to decide on a thought and action at each step based on the task, available actions, and interaction history, updating its context with the results of the action. If the action introduces new functions, the system expands its action set to include these new capabilities.

ukuina · on Dec 4, 2024

Isn't that still too wordy? It has that inescapable "LLM-ness" to it.

PunchTornado · on Dec 1, 2024

I agree. And I view it as intellectual imposture. Instead of saying something really simple which can give good results, you obfuscate it a lot to make it sound more intelligent. Reviewers shouldn't accept these kind of papers and I'm thinking that we need a Sokal moment in AI research.

lbeurerkellner · on Dec 1, 2024

Typically these logs are available, but hard to read or just dumped in some JSON file.

However, they have been efforts like https://explorer.invariantlabs.ai/benchmarks/ that try to make agents more transparent in that way (show interaction logs).

exe34 · on Dec 1, 2024

it's a delaying tactic, isn't it, if you're working on version 2, you don't want version 1 to be too obvious that somebody might scoop your version 2. I just wish the reviewers would clamp down on this kind of thing.

mosses-sandals · on Dec 1, 2024

Writers basically said in this paper let us just save certain amounts of working code snippets generated by llm and hope that they are also needed in the future, at the same time concluding that the saved code is sparsed , so this research paper at this stage is just useless.

bloomingkales · on Dec 1, 2024

I don’t know if it’s useless, but as someone with no background in ML, I’ve ad-hoc come up with the exact same idea playing around with LLMs.

So, is this just a low hanging fruit idea that looks authoritative because it’s in an academic format?

mosses-sandals · on Dec 2, 2024

Yes it is just that a low hanging fruit

wokwokwok · on Dec 1, 2024

This is super big news if it’s real.

Basically, given an agent with an initial set of predefined actions and goal, they’re saying “decompose this into steps and pick and action to achieve each step”. Pretty standard stuff.

Then they say, hey, if you can’t solve the problem with those actions (ie. failed repeatedly when attempting to solve), write some arbitrary generic python code and use that as your action for the next step.

Then save that as a new generic action, and slowly build up a library of actions to augment the initial set.

The thing is, there’s no meaningful difference between the task “write code to solve this task” and “write code to solve this action”; if you can meaningfully generate code that can, without error, perform arbitrary tasks, you’ve basically solved programming.

So… that would be quite a big deal.

That would be a real “Devon” that would actually be able to write arbitrary code to solve arbitrary problems.

…which makes me a bit sceptical.

Still, this seems to have at least worked reasonably well (as shown by being a leader on the GAIA leaderboard) so they seem to have done something that works, but I’m left wondering…

If you’ve figured out how to get an agent to write error free deterministic code to perform arbitrary actions in a chain of thought process, why are you pissing around with worrying about accumulating a library of agent actions?

That’s all entirely irrelevant and unnecessary.

Just generate code for each step.

So… something seems a bit strange around this.

I’d love to see a log of the actual problem / action / code sequences.

Kiro · on Dec 1, 2024

Devin is real. What do you mean?

Anyway, this is pretty standard stuff already. In all my agent workflows the agents are able to write their own code and execute it before passing the result to the next agent. It doesn't need to be perfect since you always have an agent validating the results, sending the task back if necessary.

I haven't read the paper beyond the synopsis so I might be missing a crucial key takeaway and I presume it has a lot of additional layers.

wokwokwok · on Dec 1, 2024

As evidenced by the reaction to Devin, no, it’s not real.

There’s a limit, beyond which agent generated code is, in general, not reliable.

All of the people who claim otherwise (like the Devin videos) have shown to be fake (1) or cherry-picked.

Having agent generated code is arbitrary code to solve arbitrary problems is. Not. A. Solved. Problem.

Yet.

…no matter, no matter how many AI bros claim otherwise, currently.

Being able to decompose complex problems into part small enough to be able to be solved by current models would be a big deal if it was real.

(Because, currently the SoTA can’t reliably do this; this should not be a remotely controversial claim to people familiar with this space)

So tldr; extraordinary claims require extraordinary evidence. Which is absent here, as far as I can tell. They specifically call out in the paper that generated actions are overly specific and don’t always work; but as I said, it’s doing well on the leader board, so it’s clearly doing something, which is working, but there’s just noooooo way of seeing what.

[1] - https://www.zeniteq.com/blog/devins-demo-as-the-first-ai-sof...

IanCal · on Dec 1, 2024

> If you’ve figured out how to get an agent to write error free deterministic code to perform arbitrary actions in a chain of thought process

You don't have to have it perfect, and the more you reuse things that you know work the less you have to build each time (reducing places for errors)

> Just generate code for each step.

We don't do this as humans, we build and reuse pieces.

quicheshore · on Dec 1, 2024

This is a great application of dynamic tooling. But figure 5 is kind of flawed. It’s not a fair comparison, when the tool call you provide doesn’t work. Obviously the LLM with code execution capabilities will do better.

killerstorm · on Dec 1, 2024

Generating code to do stuff was the idea of OpenAI Codex in 2021.

This paper basically just adds a cache? Not really novel as we already have Codex, Code Interpreter, etc.