Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I kind of want to try something like this at a larger scale in an always-on mode where I have a 'senate' of debate. Rather than responding to prompts on a case by case basis, provide a list of tasks (potentially with deadlines) and let the senate work on them, break off into groups to manage subtasks, challenge results , make suggestions. Even potentially a tree of analysts where suggestions only gets passed up the tree when the parent node thinks a lower analysis is particularly insightful.

I definitely think that directing models to approach a problem from a specific perspective can generate better or worse results. Creating a diverse set of perspectives along with critical analysis of their results should be able to produce some impressive results.

Things like this would generate a massive number of tokens, but the cost per token is definitely heading in the right direction to allow for this. There is also the possibility of setting up an AI only IRC server where anybody can connect their own models for a shared debating chamber.



In doing some DevOps-y type tasks recently (ansible, packer, docker, baking images with guestfish), I've found it very frustrating how much ChatGPT will confidently tell me to use flags on tools that don't exist, or hallicinate completely non-existent functions or behaviours. And then when I spend time trying what it suggests only to hit a wall and come back like wtf mate it breezily goes "oh yes so you're right, good job figuring that out! You're so close now! Your next step is to do X and Y," and then serves up the same detailed tutorial as before but with the flag or whatever it was that it had wrong subtly changed.

It definitely makes me feel like I'm dealing with an overenthusiastic intern who is throwing stuff over the wall without checking their work, and like maybe having a second bot sitting in front of the first one being like ARE YOUR SURE ABOUT THAT could really improve things.


You can't get more info from LLMs than it actually holds. Like Anthropic pointed if LLMs knows the name but has no other info it starts hallucinating. The same probably happens here. LLM knows there must be a flag but can't remember all of them. Likely short reminder in prompt will help. (or search web for GPT) Just my $0.02.


It certainly feels like you can just by challenging it; then it happily finds other paths to what you want. So maybe internally it needs a second voice encouraging it to think harder about alternatives upfront.


The fact that you can more info from an LLM than it holds is actually a pithy description of this whole challenge.


I did a stint in Devops and I found every models to be like this for all of the infra-as-code languages. Anything yaml based was especially bad.

Even Amazon’s own offering completely made things up about Amazon’s own formats.

I’d be curious as to why that is. It seems like there would be enough training data, and for Amazon in particular it seems like they could make a validation tool the model could use.


Maybe I'm excessively anthropomorphizing, but it does feel a bit analogous to my own thought process, like "I need feature XYZ, and based on other tools I'm more familiar with it should be an --xyz flag, so let me google for that and see if I'm right or if I instead find a four-year-old wontfix on Github where someone asked for what I need and got denied."

Except... the model is missing that final step; instead it just belches out its hypothesis, all dressed up in chirpy, confident-sounding language, certain that I'm moments away from having everything working just perfectly.


Cursor has a neat feature where you can upload custom docs, and then reference them with @Docs. I find this prevents hallucinations, and also using a reasoning model


I've enjoyed watching Claude try running commands with incorrect flags, trying them, and then adapting.


I've also found LLMs to perform poorly at DevOps tasks. Perhaps there's a lack of training data. On the bright side this hints at better job security for platform engineers.


100%. This has happened enough to me that I wished I could just inject the man page docs into it to at least act as a sanity check.


Spot on.


A year or so ago I experimented with splitting a user prompt down to a set of "different AI personas" that would each try to approach the user's problem in a different way and then bubble back up with a master arbiter for consensus.

I modeled it after the concept of advisors from Civilization II. It worked reasonably well though I think it was at least somewhat limited by being constrained to a single LLM (Mistral). It also lit my computer on fire.


What sort of personalities did you try? A group where some members have grudges against each other and will irrationally poke holes in each other’s plans could be a fun experiment.


With multiple groups with external and internal rivalries. The Always Sunny gang versus The IT Crowd.


I have played Disco Elysium, and can confirm that a bunch of inner voices arguing with each other can be fun.


In theory couldnt this just be baked into a single adversarial model?


Not entirely. Since generation is auto regressive, the next token depends on the previous tokens. Whatever analysis and decisions it has spit out will influence what it will do next. This tends to cause it to be self reinforcing.

But it's also chaotic. Small changes in input or token choices can give wildly different outcomes, particularly if the sampling distributions are fairly flat (no one right answer). So restarting the generation with a slightly different input, such as a different random seed (or in OP's case, a different temperature) can give wildly different outcomes.

If you try this, you'll see some examples of it vehemently arguing it is right and others equally arguing it is wrong. This is why LLM as judge is so poor by itself, bit also why multiple generations like used in self-consistency can be quite useful at evaluating variance and therefore uncertainty.


Yes, but I guess the model is optimized for relatively quick response, whereas these techniques are allowing the model to spend more time to generate a higher quality response


To an extent, but different models are better at different things.

That is something I'm also curious about. Given models (that use the same tokenisation) that are better at different things, would their be interesting things to find by analysing the logprobs for tokens generated from identical inputs (including cross feeding the generated token from one to another)

Surely there must be something notable at particular points when a model goes off on the wrong path.


Like, just endlessly grinding tokens, then processing the output and pulling out good ideas when the endless debate generates them?

Would be interesting what it comes up with with enough time and tokens.


This is being done, and you could apply it to a lot of domains. Go for it for whatever use case you have.


These ensembles have been tested throughout AI progress. Well scaffolded larger models have historically come out ahead in both quality and speed/cost.

Perhaps this is a parricularly effective ensemble, but I would need to see real data.


Yeah, but we'll finally get definitive proof that the government's been hiding super-intelligent axolotls from us all.


A society of mind, if you will. :)

This sounds like a fun thing to set up with a quick-enough local model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: