I kind of want to try something like this at a larger scale in an always-on mode...

mikepurvis · 2025-04-29T19:32:42 1745955162

In doing some DevOps-y type tasks recently (ansible, packer, docker, baking images with guestfish), I've found it very frustrating how much ChatGPT will confidently tell me to use flags on tools that don't exist, or hallicinate completely non-existent functions or behaviours. And then when I spend time trying what it suggests only to hit a wall and come back like wtf mate it breezily goes "oh yes so you're right, good job figuring that out! You're so close now! Your next step is to do X and Y," and then serves up the same detailed tutorial as before but with the flag or whatever it was that it had wrong subtly changed.

It definitely makes me feel like I'm dealing with an overenthusiastic intern who is throwing stuff over the wall without checking their work, and like maybe having a second bot sitting in front of the first one being like ARE YOUR SURE ABOUT THAT could really improve things.

MoonGhost · 2025-04-29T20:51:16 1745959876

You can't get more info from LLMs than it actually holds. Like Anthropic pointed if LLMs knows the name but has no other info it starts hallucinating. The same probably happens here. LLM knows there must be a flag but can't remember all of them. Likely short reminder in prompt will help. (or search web for GPT) Just my $0.02.

mikepurvis · 2025-04-29T21:23:27 1745961807

It certainly feels like you can just by challenging it; then it happily finds other paths to what you want. So maybe internally it needs a second voice encouraging it to think harder about alternatives upfront.

buu700 · 2025-04-30T07:09:52 1745996992

The fact that you can more info from an LLM than it holds is actually a pithy description of this whole challenge.

0x20cowboy · 2025-04-29T20:27:29 1745958449

I did a stint in Devops and I found every models to be like this for all of the infra-as-code languages. Anything yaml based was especially bad.

Even Amazon’s own offering completely made things up about Amazon’s own formats.

I’d be curious as to why that is. It seems like there would be enough training data, and for Amazon in particular it seems like they could make a validation tool the model could use.

mikepurvis · 2025-04-29T21:21:25 1745961685

Maybe I'm excessively anthropomorphizing, but it does feel a bit analogous to my own thought process, like "I need feature XYZ, and based on other tools I'm more familiar with it should be an --xyz flag, so let me google for that and see if I'm right or if I instead find a four-year-old wontfix on Github where someone asked for what I need and got denied."

Except... the model is missing that final step; instead it just belches out its hypothesis, all dressed up in chirpy, confident-sounding language, certain that I'm moments away from having everything working just perfectly.

meander_water · 2025-04-29T23:31:30 1745969490

Cursor has a neat feature where you can upload custom docs, and then reference them with @Docs. I find this prevents hallucinations, and also using a reasoning model

organsnyder · 2025-04-29T20:16:13 1745957773

I've enjoyed watching Claude try running commands with incorrect flags, trying them, and then adapting.

corvus-cornix · 2025-04-30T16:08:25 1746029305

I've also found LLMs to perform poorly at DevOps tasks. Perhaps there's a lack of training data. On the bright side this hints at better job security for platform engineers.

vunderba · 2025-04-29T20:20:36 1745958036

100%. This has happened enough to me that I wished I could just inject the man page docs into it to at least act as a sanity check.

nonelog · 2025-04-29T20:17:16 1745957836

Spot on.

vunderba · 2025-04-29T20:17:38 1745957858

A year or so ago I experimented with splitting a user prompt down to a set of "different AI personas" that would each try to approach the user's problem in a different way and then bubble back up with a master arbiter for consensus.

I modeled it after the concept of advisors from Civilization II. It worked reasonably well though I think it was at least somewhat limited by being constrained to a single LLM (Mistral). It also lit my computer on fire.

bee_rider · 2025-04-29T21:35:21 1745962521

What sort of personalities did you try? A group where some members have grudges against each other and will irrationally poke holes in each other’s plans could be a fun experiment.

throwup238 · 2025-04-29T22:07:04 1745964424

With multiple groups with external and internal rivalries. The Always Sunny gang versus The IT Crowd.

vintermann · 2025-04-30T12:03:00 1746014580

I have played Disco Elysium, and can confirm that a bunch of inner voices arguing with each other can be fun.

nonethewiser · 2025-04-29T19:19:57 1745954397

In theory couldnt this just be baked into a single adversarial model?

RevEng · 2025-04-30T03:39:17 1745984357

Not entirely. Since generation is auto regressive, the next token depends on the previous tokens. Whatever analysis and decisions it has spit out will influence what it will do next. This tends to cause it to be self reinforcing.

But it's also chaotic. Small changes in input or token choices can give wildly different outcomes, particularly if the sampling distributions are fairly flat (no one right answer). So restarting the generation with a slightly different input, such as a different random seed (or in OP's case, a different temperature) can give wildly different outcomes.

If you try this, you'll see some examples of it vehemently arguing it is right and others equally arguing it is wrong. This is why LLM as judge is so poor by itself, bit also why multiple generations like used in self-consistency can be quite useful at evaluating variance and therefore uncertainty.

tonmoy · 2025-04-29T19:27:56 1745954876

Yes, but I guess the model is optimized for relatively quick response, whereas these techniques are allowing the model to spend more time to generate a higher quality response

Lerc · 2025-04-29T19:34:38 1745955278

To an extent, but different models are better at different things.

That is something I'm also curious about. Given models (that use the same tokenisation) that are better at different things, would their be interesting things to find by analysing the logprobs for tokens generated from identical inputs (including cross feeding the generated token from one to another)

Surely there must be something notable at particular points when a model goes off on the wrong path.

crowcroft · 2025-04-29T20:03:04 1745956984

Like, just endlessly grinding tokens, then processing the output and pulling out good ideas when the endless debate generates them?

Would be interesting what it comes up with with enough time and tokens.

danielmarkbruce · 2025-04-29T20:23:44 1745958224

This is being done, and you could apply it to a lot of domains. Go for it for whatever use case you have.

kmacdough · 2025-04-30T09:52:02 1746006722

These ensembles have been tested throughout AI progress. Well scaffolded larger models have historically come out ahead in both quality and speed/cost.

Perhaps this is a parricularly effective ensemble, but I would need to see real data.

nativeit · 2025-04-30T00:23:38 1745972618

Yeah, but we'll finally get definitive proof that the government's been hiding super-intelligent axolotls from us all.

taneq · 2025-04-29T22:48:23 1745966903

A society of mind, if you will. :)

This sounds like a fun thing to set up with a quick-enough local model.