This is a really natural extension of CoT. I was experimenting for a month or tw...

llmsolutions · on Aug 25, 2023

This is great, we've built eerily similar tooling for our internal projects. Unfortunately, in our experiments with OpenAI chat completions, the "reproducible" part has proven to just not be possible. There's nothing more frustrating than spending an hour debugging your chain, only to realize a binary classification prompt has decided to flip after hundreds of consistent executions!

knexer · on Aug 29, 2023

Oh yeah, I empathize with that frustration! I've come to the conclusion that LLM applications require a fundamentally different approach to reliability than we're used to. Traditional programming is about composing abstractions to build more complex abstractions; because the low-level abstractions are so close to perfect, we can build these immense towers of abstractions and still have effective guarantees on their behavior. With LLMs, perfect abstractions are impossible, and composing them naively will exponentially magnify their imperfections.

Instead, I think composing LLMs needs to be done in a way that degrades gracefully, with resilience to failure being a fundamental consideration. Biology has similar properties; complex biological systems (ecosystems, cells, etc) have feedback loops, redundancy, and most of all diversity. If we take a similar approach to building LLM apps, we'll end up with things like:

- multiple different prompts used in parallel, with results joined e.g. with voting. A change in how one prompt behaves can thus only have a bounded effect on the system as a whole.

- some way for an LLM to productively express 'this thing you're asking me to do is nonsense', with monitoring and continuous evaluation hooked up to that signal, and maybe runtime retry behavior as well. This can help with when you get into situations where prompt A gets an "I'm afraid I can't do that" response, and then you give that to prompt B as if it is a valid thing, and that cascades through the rest of the application.

llmtaskgraph as a library is designed to make building, operating and maintaining systems with these sorts of features easier - without good observability, it's impossible to know if some feedback loop is doing its job, or which prompts in a pool are behaving well vs poorly, much less what effect they are having on the rest of the system.

Sorry for the wall of text, I got a bit nerd-sniped. :)

petra · on Aug 25, 2023

That's really interesting. Did using genetic algorithms enabled GPT4 to be truly creative ?