They created an environment to expose LLMs to problems and test their performanc...

simonw · 2025-07-02T14:33:43 1751466823

That Apple paper mainly demonstrated that "reasoning" LLMs - with no access to additional tools - can't solve problems that deliberately exceed their token context length.

I don't think it has much relevance at all to a conversational about how good LLMs are at solving programming problems by running tools in a loop.

I keep seeing this idea that LLMs can't handle problems that aren't in their training data and it's frustrating because anyone who has spent significant time working with these systems knows that it obviously isn't true.

pydry · 2025-07-03T07:57:16 1751529436

It demonstrated that there was a hard limit on the complexity of a puzzle that LLMs could solve no matter how many tokens they threw at it (using a form of puzzle construction that it ensured that the LLM couldn't just refer to its training data to solve it).