Graph of Thoughts: Solving Elaborate Problems with Large Language Models

knexer · on Aug 24, 2023

This is a really natural extension of CoT. I was experimenting for a month or two with a similar concept in a hobby project this past spring: https://github.com/knexer/llmtaskgraph . I'm really excited to see more people exploring in this direction!

I was focusing more on an engineering perspective; modeling a complex LLM-and-code process as a dependency graph makes it easy to:

- add tracing to continuously measure and monitor even post-deployment

- perform reproducible experiments, a la time-rewinding debugging

- speed up iteration on prompts by caching the parts of the program you aren't working on right now

My test case was using GPT4 to implement the operators in a genetic algorithm, which tbh is a fascinating concept of its own. I drifted away after a while (curse that ADHD) but had a great time with the project in the meantime.

llmsolutions · on Aug 25, 2023

This is great, we've built eerily similar tooling for our internal projects. Unfortunately, in our experiments with OpenAI chat completions, the "reproducible" part has proven to just not be possible. There's nothing more frustrating than spending an hour debugging your chain, only to realize a binary classification prompt has decided to flip after hundreds of consistent executions!

knexer · on Aug 29, 2023

Oh yeah, I empathize with that frustration! I've come to the conclusion that LLM applications require a fundamentally different approach to reliability than we're used to. Traditional programming is about composing abstractions to build more complex abstractions; because the low-level abstractions are so close to perfect, we can build these immense towers of abstractions and still have effective guarantees on their behavior. With LLMs, perfect abstractions are impossible, and composing them naively will exponentially magnify their imperfections.

Instead, I think composing LLMs needs to be done in a way that degrades gracefully, with resilience to failure being a fundamental consideration. Biology has similar properties; complex biological systems (ecosystems, cells, etc) have feedback loops, redundancy, and most of all diversity. If we take a similar approach to building LLM apps, we'll end up with things like:

- multiple different prompts used in parallel, with results joined e.g. with voting. A change in how one prompt behaves can thus only have a bounded effect on the system as a whole.

- some way for an LLM to productively express 'this thing you're asking me to do is nonsense', with monitoring and continuous evaluation hooked up to that signal, and maybe runtime retry behavior as well. This can help with when you get into situations where prompt A gets an "I'm afraid I can't do that" response, and then you give that to prompt B as if it is a valid thing, and that cascades through the rest of the application.

llmtaskgraph as a library is designed to make building, operating and maintaining systems with these sorts of features easier - without good observability, it's impossible to know if some feedback loop is doing its job, or which prompts in a pool are behaving well vs poorly, much less what effect they are having on the rest of the system.

Sorry for the wall of text, I got a bit nerd-sniped. :)

petra · on Aug 25, 2023

That's really interesting. Did using genetic algorithms enabled GPT4 to be truly creative ?

brutusborn · on Aug 24, 2023

This is fantastic. I’d love to see a system that uses an LLM to generate knowledge graphs from academic papers to make them machine readable.

Some kind of prompt like “does paper P contain idea A and does it suggest that A is true.” Then you could automatically categorise citations by whether they agree/disagree with the cited paper.

Sometimes I see papers with 2,000 citations and I wonder: how many of those are dis/agreeing with the paper.

throwaway290 · on Aug 24, 2023

By that logic why don't you just directly ask the LLM on whether a citation agrees or not, you are already trusting it to be correct with that graph in the first place...

brutusborn · on Aug 24, 2023

You are correct, I wrote that in a rush and mixed up examples in my head.

I don’t think you need to trust the LLMs for this kind of thing to be very useful. The LLM could generate the KG with every node labelled as “autogenerated.” When you use the graph for research, you are still going to read the papers you are interested in so you can then update the relevant citation node with the label “human checked.”

If a research group uses the same graph over time, the nodes will gradually become “trustworthy” (ie verified by humans). Maybe even get reviewers to update a papers graph during review and publish that for other groups to add to their graphs.

visarga · on Aug 24, 2023

That's a great idea. It would be easier to know if anyone done what you want, get a better overview of the current knowledge.

throwaway290 · on Aug 25, 2023

Then why not just author knowledge as a graph in the first place, as humans?

brutusborn · on Aug 25, 2023

That would be great but if we can’t even convince most researchers to use open science principles then good luck convincing them to spend the time to convert their papers into graphs.

throwaway290 · on Aug 25, 2023

I reckon researchers would be on board if the tooling is there.

3abiton · on Aug 24, 2023

I already feel synthesis is becoming a useless skill for humans.

nvm0n2 · on Aug 24, 2023

This has already been studied. Negative citations are vanishingly rare. So virtually all of them will be either neutral or positive.

jll29 · on Aug 24, 2023

> Sometimes I see papers with 2,000 citations and I wonder: how many of those are dis/agreeing with the paper.

One example of an author that is very influential, despite causing a lot of disagreement (even in more than one discipline) is Noam Chomsky, who is also the most cited person alive, and the second most cited person in recorded history after Aristotle. His views about generative grammar are in part revolutionary, in part plain wrong; your assessment of his views about the Palestine conflict and U.S. foreign politics will largely depend on your political leanings; and his contribution to formal language theory is fundamental regardless of your leanings (Chomsky hierarchy; Chomsky Normal Form).

> This has already been studied. Negative citations are vanishingly rare. So virtually all of them will be either neutral or positive. Might be a difference between science/engineering (where true) and humanities (where a larger amount is negative).

melagonster · on Aug 25, 2023

sounds like biologists cites Darwin. if you find something he did not find, cool; if your region is similar to him, cooler :)

I guess most of conditions is because when biologists debate, they only have one opponent.

photonthug · on Aug 24, 2023

Applications to case law might be interesting since establishing precedent is somewhat more nonbinary

brutusborn · on Aug 25, 2023

Know any good papers on this?

nvm0n2 · on Aug 26, 2023

https://www.pnas.org/doi/10.1073/pnas.1502280112

Out of 762,355 citations from 15,731 articles in the Journal of Immunology (1998–2007), we identified 18,304 as negative (about 2.4% of the total).

anentropic · on Aug 24, 2023

There are already models which are specialised to this task, e.g. https://huggingface.co/Babelscape/rebel-large (if I understood you correctly)

Though with LLM and sufficient context length you could probably just use that prompt directly on the academic paper without ever generating a knowledge graph

firewolf34 · on Aug 24, 2023

The more I read about ML, the more I begin to believe that - psychologically speaking - hierarchy (esp. graph structures, trees) are absolutely core to advanced information processing in general.

visarga · on Aug 24, 2023

We got symbolic AI sneaked into the connectionist model by making a graph of thoughts. A graph can explicitly implement any algorithm or data structure.

They could make it more efficient by implementing a kind of "hard attention". Each token should have access to a sparse subset of the whole input, so it would be like a node in a graph only having access to a few neighbours. Could solve the very large context issue. This can also be parallelised, running all thought nodes in parallel, of course each with a sparse view of the whole input making it much faster.

For example when reading a long book, the model would spawn nodes for each person, location or event of interest, and they would track the source text as the action develops. A mental map of the book. That would surely help a model deal with many moving pieces of information.

Or when solving a problem, the model could spawn a node to work on a subproblem, parametrised by the parent node with the right inputs. Then the node would report back with the answer and the parent continues. This would work with recursive calls.

The new cpu is the LLM and the clock tick is 1 token.

barrenko · on Aug 24, 2023

Could you expand on "A graph can explicitly implement any algorithm or data structure."?

gryn · on Aug 24, 2023

you can use graph transformation to perform general computation.

https://en.wikipedia.org/wiki/Graph_rewriting

probably not was GP meant, but something along those lines.

barrenko · on Aug 26, 2023

Appreciate it.

visarga · on Aug 24, 2023

You could create a node for each execution step, or data field.

Vox_Leone · on Aug 24, 2023

And thus you could use UML to formalize prompting. An activity diagram can be viewed as a chain of thought. Fulfilling the UML promise.

macawfish · on Aug 24, 2023

This article is about how non-hierarchical graphs, those with cycles, are performing better than trees or chains.

theptip · on Aug 24, 2023

I suspect that you'll still find strong hierarchy in an optimized/well-performing graph of thought. The human brain, for example, also has recurrence, but it's limited.

It seems pretty intuitive that you'd get a "task / subtask" split for example, with feedback from the latter, but semantic content largely flowing from the former to the latter.

brutusborn · on Aug 24, 2023

I won’t pretend to understand it, but this reminds me of the idea of markov blankets when using the free energy principle to model congition. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7955287/#:~:tex....

Cloudly · on Aug 24, 2023

https://github.com/spcl/graph-of-thoughts Code for the paper too

refulgentis · on Aug 24, 2023

Need more data.

- Complex generalization with a simple unstated justification: last 'paper' like this was ToT, and a tree is a graph with constraints.

- Framework is discussed cognitively, units of "thoughts" "scored". (AutoGPT redux, having the LLM eat its own output repetitively improves things but isn't a panacea)

- Only sorting demonstrated "due to space constraints" -- unclear what that means, it seems much more likely it was self-enforced time constraints

- Error rate is consistently 14%.

- ~10x the cost for ~15% error rate in sorting instead of ~30%

- GPT3.5

obeavs · on Aug 25, 2023

Cool to see it going this way. Over the last two years, we've been busy reinventing contracts and financial models as the dependency graphs that they are to provide a deterministic, intermediate representation of this information in finance.

Still not sold that it'll fly in finance without that sort of observable, intermediate representation.

darkteflon · on Aug 25, 2023

Could you elaborate a little on the work you’re doing with contracts?

sorokod · on Aug 24, 2023

For the kind of synergy I demand, only the hypergraph of thought ( HoT ) will do.

groceryheist · on Aug 24, 2023

Intractable. Only the simplical complex of thought provides the optimal balance of expressiveness and constraint.

sorokod · on Aug 24, 2023

Well ok, maybe only for distilling the essence of whole networks of thoughts

zalyalov · on Aug 24, 2023

Weird that they claim to use arbitrary graphs, while in reality it's a weird subclass of DAGs with one-vertex loops kind of allowed, except they don't really make sense to be represented as loops.

billgeller06 · on Aug 27, 2023

I think the paper "Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models" (AoT) is big competitor with this paper. It shows that we can get similar results to Tree of Thoughts just by using a single query. AoT seems to be even surpassing ToT in some of the experiments.

creer · on Aug 24, 2023

I keep feeling that LLMs are one direction to address the thorny "common sense" issue of AI. Mountains of training text incorporate, probably, most common sense (and a lot of nonsense). It's beautiful to see so many ideas come out currently to make better use of the models. Including the fast progress made with image generation.

marcopicentini · on Aug 24, 2023

What are other use cases that could be made only by LLM ?

Number sorting is faster using code.

empath-nirvana · on Aug 24, 2023

The point of using number sorting for this paper is that its

A) difficult to impossible for an LLM to do in a single pass B) easy to verify the correctness.

In general, the point isn't finding things that only an LLM can do, but find things that LLMs can do with decent results at lower cost than getting a human to do it.

jbay808 · on Aug 24, 2023

It is only difficult for a LLM to sort a list of numbers if the list is longer than half of the context window. (Source: I tested this myself[1]). The sorts are not error-free every time, but with sufficient training they become error-free the vast majority of the time, even for long lists. This is not especially surprising because transformers are capable of directly representing sorting programs.[2]

[1] https://jbconsulting.substack.com/p/its-not-just-statistics-...

[2] https://arxiv.org/abs/2106.06981

empath-nirvana · on Aug 25, 2023

Of course you can train a neural network to sort numbers, but I'm talking about a general LLM which hasn't been trained to sort numbers specifically. Training a GPT network to sort numbers is not what I would consider to be a Large Language Model.

creer · on Aug 24, 2023

I don't think efficiency is important at this point. Finding that it's possible "this way" opens the door for more work and more applications. (Which doesn't prevent others to already work on efficiency.)

mountainriver · on Aug 25, 2023

Would love to see it compared to RAP rather than just ToT, MCTS is SOTA in RL task planning

jstx1 · on Aug 24, 2023

How well do these papers replicate? In some of experiments with GPT-4 I've seen chain-of-thought style prompting make answers noticeably worse than plainly asking a question once.

rmbyrro · on Aug 24, 2023

Really? Do you have any examples to share? I'd be surprised to see that in action.