This is a really natural extension of CoT. I was experimenting for a month or two with a similar concept in a hobby project this past spring: https://github.com/knexer/llmtaskgraph . I'm really excited to see more people exploring in this direction!
I was focusing more on an engineering perspective; modeling a complex LLM-and-code process as a dependency graph makes it easy to:
- add tracing to continuously measure and monitor even post-deployment
- perform reproducible experiments, a la time-rewinding debugging
- speed up iteration on prompts by caching the parts of the program you aren't working on right now
My test case was using GPT4 to implement the operators in a genetic algorithm, which tbh is a fascinating concept of its own. I drifted away after a while (curse that ADHD) but had a great time with the project in the meantime.
This is great, we've built eerily similar tooling for our internal projects. Unfortunately, in our experiments with OpenAI chat completions, the "reproducible" part has proven to just not be possible. There's nothing more frustrating than spending an hour debugging your chain, only to realize a binary classification prompt has decided to flip after hundreds of consistent executions!
Oh yeah, I empathize with that frustration! I've come to the conclusion that LLM applications require a fundamentally different approach to reliability than we're used to. Traditional programming is about composing abstractions to build more complex abstractions; because the low-level abstractions are so close to perfect, we can build these immense towers of abstractions and still have effective guarantees on their behavior. With LLMs, perfect abstractions are impossible, and composing them naively will exponentially magnify their imperfections.
Instead, I think composing LLMs needs to be done in a way that degrades gracefully, with resilience to failure being a fundamental consideration. Biology has similar properties; complex biological systems (ecosystems, cells, etc) have feedback loops, redundancy, and most of all diversity. If we take a similar approach to building LLM apps, we'll end up with things like:
- multiple different prompts used in parallel, with results joined e.g. with voting. A change in how one prompt behaves can thus only have a bounded effect on the system as a whole.
- some way for an LLM to productively express 'this thing you're asking me to do is nonsense', with monitoring and continuous evaluation hooked up to that signal, and maybe runtime retry behavior as well. This can help with when you get into situations where prompt A gets an "I'm afraid I can't do that" response, and then you give that to prompt B as if it is a valid thing, and that cascades through the rest of the application.
llmtaskgraph as a library is designed to make building, operating and maintaining systems with these sorts of features easier - without good observability, it's impossible to know if some feedback loop is doing its job, or which prompts in a pool are behaving well vs poorly, much less what effect they are having on the rest of the system.
Sorry for the wall of text, I got a bit nerd-sniped. :)
This is fantastic. I’d love to see a system that uses an LLM to generate knowledge graphs from academic papers to make them machine readable.
Some kind of prompt like “does paper P contain idea A and does it suggest that A is true.” Then you could automatically categorise citations by whether they agree/disagree with the cited paper.
Sometimes I see papers with 2,000 citations and I wonder: how many of those are dis/agreeing with the paper.
By that logic why don't you just directly ask the LLM on whether a citation agrees or not, you are already trusting it to be correct with that graph in the first place...
You are correct, I wrote that in a rush and mixed up examples in my head.
I don’t think you need to trust the LLMs for this kind of thing to be very useful. The LLM could generate the KG with every node labelled as “autogenerated.” When you use the graph for research, you are still going to read the papers you are interested in so you can then update the relevant citation node with the label “human checked.”
If a research group uses the same graph over time, the nodes will gradually become “trustworthy” (ie verified by humans). Maybe even get reviewers to update a papers graph during review and publish that for other groups to add to their graphs.
That would be great but if we can’t even convince most researchers to use open science principles then good luck convincing them to spend the time to convert their papers into graphs.
> Sometimes I see papers with 2,000 citations and I wonder: how many of those are dis/agreeing with the paper.
One example of an author that is very influential, despite causing a lot of disagreement (even in more than one discipline) is Noam Chomsky, who is also the most cited person alive, and the second most cited person in recorded history after Aristotle. His views about generative grammar are in part revolutionary, in part plain wrong; your assessment of his views about the Palestine conflict and U.S. foreign politics will largely depend on your political leanings; and his contribution to formal language theory is fundamental regardless of your leanings (Chomsky hierarchy; Chomsky Normal Form).
> This has already been studied. Negative citations are vanishingly rare. So virtually all of them will be either neutral or positive.
Might be a difference between science/engineering (where true) and humanities (where a larger amount is negative).
Though with LLM and sufficient context length you could probably just use that prompt directly on the academic paper without ever generating a knowledge graph
The more I read about ML, the more I begin to believe that - psychologically speaking - hierarchy (esp. graph structures, trees) are absolutely core to advanced information processing in general.
We got symbolic AI sneaked into the connectionist model by making a graph of thoughts. A graph can explicitly implement any algorithm or data structure.
They could make it more efficient by implementing a kind of "hard attention". Each token should have access to a sparse subset of the whole input, so it would be like a node in a graph only having access to a few neighbours. Could solve the very large context issue. This can also be parallelised, running all thought nodes in parallel, of course each with a sparse view of the whole input making it much faster.
For example when reading a long book, the model would spawn nodes for each person, location or event of interest, and they would track the source text as the action develops. A mental map of the book. That would surely help a model deal with many moving pieces of information.
Or when solving a problem, the model could spawn a node to work on a subproblem, parametrised by the parent node with the right inputs. Then the node would report back with the answer and the parent continues. This would work with recursive calls.
The new cpu is the LLM and the clock tick is 1 token.
I suspect that you'll still find strong hierarchy in an optimized/well-performing graph of thought. The human brain, for example, also has recurrence, but it's limited.
It seems pretty intuitive that you'd get a "task / subtask" split for example, with feedback from the latter, but semantic content largely flowing from the former to the latter.
- Complex generalization with a simple unstated justification: last 'paper' like this was ToT, and a tree is a graph with constraints.
- Framework is discussed cognitively, units of "thoughts" "scored". (AutoGPT redux, having the LLM eat its own output repetitively improves things but isn't a panacea)
- Only sorting demonstrated "due to space constraints" -- unclear what that means, it seems much more likely it was self-enforced time constraints
- Error rate is consistently 14%.
- ~10x the cost for ~15% error rate in sorting instead of ~30%
Cool to see it going this way. Over the last two years, we've been busy reinventing contracts and financial models as the dependency graphs that they are to provide a deterministic, intermediate representation of this information in finance.
Still not sold that it'll fly in finance without that sort of observable, intermediate representation.
Weird that they claim to use arbitrary graphs, while in reality it's a weird subclass of DAGs with one-vertex loops kind of allowed, except they don't really make sense to be represented as loops.
I think the paper "Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models" (AoT) is big competitor with this paper. It shows that we can get similar results to Tree of Thoughts just by using a single query. AoT seems to be even surpassing ToT in some of the experiments.
I keep feeling that LLMs are one direction to address the thorny "common sense" issue of AI. Mountains of training text incorporate, probably, most common sense (and a lot of nonsense). It's beautiful to see so many ideas come out currently to make better use of the models. Including the fast progress made with image generation.
The point of using number sorting for this paper is that its
A) difficult to impossible for an LLM to do in a single pass
B) easy to verify the correctness.
In general, the point isn't finding things that only an LLM can do, but find things that LLMs can do with decent results at lower cost than getting a human to do it.
It is only difficult for a LLM to sort a list of numbers if the list is longer than half of the context window. (Source: I tested this myself[1]). The sorts are not error-free every time, but with sufficient training they become error-free the vast majority of the time, even for long lists. This is not especially surprising because transformers are capable of directly representing sorting programs.[2]
Of course you can train a neural network to sort numbers, but I'm talking about a general LLM which hasn't been trained to sort numbers specifically. Training a GPT network to sort numbers is not what I would consider to be a Large Language Model.
I don't think efficiency is important at this point. Finding that it's possible "this way" opens the door for more work and more applications. (Which doesn't prevent others to already work on efficiency.)
How well do these papers replicate? In some of experiments with GPT-4 I've seen chain-of-thought style prompting make answers noticeably worse than plainly asking a question once.
I was focusing more on an engineering perspective; modeling a complex LLM-and-code process as a dependency graph makes it easy to:
- add tracing to continuously measure and monitor even post-deployment
- perform reproducible experiments, a la time-rewinding debugging
- speed up iteration on prompts by caching the parts of the program you aren't working on right now
My test case was using GPT4 to implement the operators in a genetic algorithm, which tbh is a fascinating concept of its own. I drifted away after a while (curse that ADHD) but had a great time with the project in the meantime.