I think it's disingenuous to characterize these solutions as "LLMs solving probl...

ineedasername · 2025-11-06T13:47:02 1762436822

Well, there's the goal posts moved and a Scotsman denied. It's got an infrastructure in which it operates and "didn't show its work" so it takes an F in maths.

DroneBetter · 2025-11-06T14:54:57 1762440897

well, it produced not just the solutions to the problems but also programs that generate them which can be reverse-engineered

wizzwizz4 · 2025-11-06T14:29:35 1762439375

A random walk can do mathematics, with this kind of infrastructure.

Isabelle/HOL has a tool called Sledgehammer, which is the hackiest hack that ever hacked[0], basically amounting to "run a load of provers in parallel, with as much munging as it takes". (Plumbing them together is a serious research contribution, which I'm not at all belittling.) I've yet to see ChatGPT achieve anything like what it's capable of.

[0]: https://lawrencecpaulson.github.io/2022/04/13/Sledgehammer.h...

DroneBetter · 2025-11-06T15:00:00 1762441200

yeah but random walks can't improve upon the state of the art on many-dimensional numerical optimisation problems of the nature discussed here, on account of they're easy enough to to implement to have been tried already and had their usefulness exhausted; this does present a meaningful improvement over them in its domain.

wizzwizz4 · 2025-11-06T17:35:20 1762450520

When I see announcements that say "we used a language model for X, and got novel results!", I play a little game where I identify the actual function of the language model in the system, and then replace it with something actually suited for that task. Here, the language model is used as the mutation / crossover component of a search through the space of computer programs.

What you really want here is represent the programs using an information-dense scheme, endowed with a pseudoquasimetric such that semantically-similar programs are nearby (and vice versa); then explore the vicinity of successful candidates. Ordinary compression algorithms satisfy "information-dense", but the metrics they admit aren't that great. Something that does work pretty well is embedding the programs into the kind of high-dimensional vector space you get out of a predictive text model: there may be lots of non-programs in the space, but (for a high-quality model) those are mostly far away from the programs, so exploring the neighbourhood of programs won't encounter them often. Because I'm well aware of the flaws of such embeddings, I'd add some kind of token-level fuzzing to the output, biased to avoid obvious syntax errors: that usually won't move the embedding much, but will occasionally jump further (in vector space) than the system would otherwise search.

So, an appropriate replacement for this generative language model would be some kind of… generative language model. Which is why I'm impressed by this paper.

There are enough other contributions in this paper that slotting a bog-standard genetic algorithm over program source in place of the language model could achieve comparable results; but I wouldn't expect it to be nearly as effective in each generation. If the language model is a particularly expensive part of the runtime (as the paper suggests might be the case), then I expect it's worth trying to replace it with a cruder-but-cheaper bias function; but otherwise, you'd need something more sophisticated to beat it.

(P.S.: props for trying to bring this back on-topic, but this subthread was merely about AI hype, not actually about the paper.)

Edit: Just read §3.2 of the paper. The empirical observations match the theory I've described here.

ineedasername · 2025-11-06T15:56:54 1762444614

A random walk could not do the mathematics in this article-- which was essentially the entire starting point for the article.