I wonder why they fail this specific way. If you just let them do stuff everythi...

samdjstephens · 2026-03-12T08:46:15 1773305175

I suspect it has something to do with a) the average quality of code in open source repos and b) the way the reward signal is applied in RL post-training - does the model face consequences of a brittle implementation for a task?

I wonder if these RL runs can extend over multiple sequential evaluations, where poor design in an early task hampers performance later on, as measured by amount of tokens required to add new functionality without breaking existing functionality.

foo42 · 2026-03-12T11:04:41 1773313481

Yeah I've been wondering if the increasing coding RL is going to draw models towards very short term goals relative to just learning from open source code in the wild

catlifeonmars · 2026-03-12T12:58:07 1773320287

To me this seems like a natural consequence of the next-token prediction model. In one particular prompt you can’t “backtrack” once you’ve emitted a token. You can only move forwards. You can iteratively refine (e.g the agent can one shot itself repeatedly), but the underlying mechanism is still present.

I can’t speak for all humans, but I tend to code “nonlinearly”, jumping back and forth and typically going from high level (signatures, type definitions) to low level (fill in function bodies). I also do a lot of deletion as I decide that actually one function isn’t needed or if I find a simpler way to phrase a particular section.

Edit: in fact thinking on this more, code is _much_ closer to a tree than sequence of tokens. Not sure what to do with that, except maybe to try a tree based generator which iteratively adds child nodes.

tobr · 2026-03-12T15:06:59 1773328019

This would make sense to me as an explanation when it only outputs code. (And I think it explains why code often ends up subtly mangled when moved in a refactoring, where a human would copy paste, the agent instead has to ”retype” it and often ends up slightly changing formatting, comments, identifiers, etc.)

But for the most part, it’s spending more tokens on analysis and planning than pure code output, and that’s where these problems need to be caught.

catlifeonmars · 2026-03-14T15:48:20 1773503300

I feel like planning is also inherently not sequential. Typically you plan in broad strokes, then recursively jump in and fill in the details. On the surface it doesn’t seem to be all that much different than codegen. Code is just more highly specified planning. Maybe I’m misunderstanding your point?

OtomotO · 2026-03-12T07:48:36 1773301716

All it does is generate soup. Some of which may taste good.

There is no thinking, no matter what marketing tells you.

Antibabelic · 2026-03-12T09:07:12 1773306432

LLMs are next token predictors. Their core functionality boils down to simply adding more stuff.

logicchains · 2026-03-12T12:29:13 1773318553

They do what you tell them to. If you regularly tell them to look for opportunities to clean up/refactor the code, they will.