This is a really natural extension of CoT. I was experimenting for a month or two with a similar concept in a hobby project this past spring: https://github.com/knexer/llmtaskgraph . I'm really excited to see more people exploring in this direction!
I was focusing more on an engineering perspective; modeling a complex LLM-and-code process as a dependency graph makes it easy to:
- add tracing to continuously measure and monitor even post-deployment
- perform reproducible experiments, a la time-rewinding debugging
- speed up iteration on prompts by caching the parts of the program you aren't working on right now
My test case was using GPT4 to implement the operators in a genetic algorithm, which tbh is a fascinating concept of its own. I drifted away after a while (curse that ADHD) but had a great time with the project in the meantime.
This is great, we've built eerily similar tooling for our internal projects. Unfortunately, in our experiments with OpenAI chat completions, the "reproducible" part has proven to just not be possible. There's nothing more frustrating than spending an hour debugging your chain, only to realize a binary classification prompt has decided to flip after hundreds of consistent executions!
Oh yeah, I empathize with that frustration! I've come to the conclusion that LLM applications require a fundamentally different approach to reliability than we're used to. Traditional programming is about composing abstractions to build more complex abstractions; because the low-level abstractions are so close to perfect, we can build these immense towers of abstractions and still have effective guarantees on their behavior. With LLMs, perfect abstractions are impossible, and composing them naively will exponentially magnify their imperfections.
Instead, I think composing LLMs needs to be done in a way that degrades gracefully, with resilience to failure being a fundamental consideration. Biology has similar properties; complex biological systems (ecosystems, cells, etc) have feedback loops, redundancy, and most of all diversity. If we take a similar approach to building LLM apps, we'll end up with things like:
- multiple different prompts used in parallel, with results joined e.g. with voting. A change in how one prompt behaves can thus only have a bounded effect on the system as a whole.
- some way for an LLM to productively express 'this thing you're asking me to do is nonsense', with monitoring and continuous evaluation hooked up to that signal, and maybe runtime retry behavior as well. This can help with when you get into situations where prompt A gets an "I'm afraid I can't do that" response, and then you give that to prompt B as if it is a valid thing, and that cascades through the rest of the application.
llmtaskgraph as a library is designed to make building, operating and maintaining systems with these sorts of features easier - without good observability, it's impossible to know if some feedback loop is doing its job, or which prompts in a pool are behaving well vs poorly, much less what effect they are having on the rest of the system.
Sorry for the wall of text, I got a bit nerd-sniped. :)
I was focusing more on an engineering perspective; modeling a complex LLM-and-code process as a dependency graph makes it easy to:
- add tracing to continuously measure and monitor even post-deployment
- perform reproducible experiments, a la time-rewinding debugging
- speed up iteration on prompts by caching the parts of the program you aren't working on right now
My test case was using GPT4 to implement the operators in a genetic algorithm, which tbh is a fascinating concept of its own. I drifted away after a while (curse that ADHD) but had a great time with the project in the meantime.