Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.
The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.
That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.
> The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide, and how to define the success criteria for the model.
Your test case seems like a quintessential example where you're missing that last step.
Since it is unlikely that you understand the math behind fractals or x86 assembly (apologies if I'm wrong on this), your only means for verifying the accuracy of your solution is a superficial visual inspection, e.g. "Does it look like the Mandelbrot series?"
Ideally, your evaluation criteria would be expressed as a continuous function, but at the very least, it should take the form of a sufficiently diverse quantifiable set of discrete inputs and their expected outputs.
That's exactly why I like using Mandelbrot as a demo: it's perfect for "superficial visual inspection".
With a bunch more work I could likely have got a vision LLM to do that visual inspection for me in the assembly example, but having a human in the loop for that was much more productive.
I think it's irrelevant. The point they are trying to make is anytime you ask a LLM for something that's outside of your area of expertise you have very little to no way to insure it is correct.
> anytime you ask a LLM for something that's outside of your area of expertise you have very little to no way to insure it is correct.
I regularly use LLMs to code specific functions I don't necessarily understand the internals of. Most of the time I do that, it's something math-heavy for a game. Just like any function, I put it under automated and manual tests. Still, I review and try to gain some intuition about what is happening, but it is still very far of my area of expertise, yet I can be sure it works as I expect it to.
I’ve been thinking about using LLMs for brute forcing problems too.
Like LLMs kinda suck at typescript generics. They’re surprisingly bad at them. Probably because it’s easy to write generics that look correct, but are then screwy in many scenarios. Which is also why generics are hard for humans.
If you could have any LLM actually use TSC, it could run tests, make sure things are inferring correctly, etc. it could just keep trying until it works. I’m not sure this is a way to produce understandable or maintainable generics, but it would be pretty neat.
Also while typing this is realized that cursor can see typescript errors. All I need are some utility testing types, and I could have cursor write the tests and then brute force the problem!
If I ever actually do this I’ll update this comment lol
Giving LLMs the right context -- eg in the form of predefined "cognitive tools", as explored with a ton of rigor
here^1 -- seems like the way forward, at least to this casual observer.
> LLM in a sandbox using tools in a loop, you can brute force that problem
Does this require using big models through their APIs and spending a lot of tokens?
Or can this be done either with local models (probably very slow), or with subscriptions like Claude Code with Pro (without hitting the rate/usage limits)?
I saw the Mandelbrot experiment, it was very cool, but still a rather small project, not really comparable to a complex/bigger/older code base for a platform used in production
The local models aren't quite good enough for this yet in my experience - the big hosted models (o3, Gemini 2.5, Claude 4) only just crossed the capability threshold for this to start working well.
I think it's possible we'll see a local model that can do this well within the next few months though - it needs good tool calling, not an encyclopedic knowledge of the world. Might be possible to fit that in a model that runs locally.
> it needs good tool calling, not an encyclopedic knowledge of the world
I wonder if there are any groups/companies out there building something like this
Would love to have models that only know 1 or 2 languages (eg. python + js), but are great at them and at tool calling. Definitely don't need my coding agent to know all of Wikipedia and translating between 10 different languages
1. A special code dataset
2. A bunch of "unrelated" books
My understanding is that the model trained on just the first will never beat the model trained on both. Bloomberg model is my favorite example of this.
If you can squirell away special data then that special data plus everything else will beat the any other models. But that's basically what openai, google, and anthropic are all currently doing.
(Would also be interesting if one could have a few LLMs working together on red/green TDD approach - have an orchestrator that parse requirements, and dispatch a red goblin to write a failing test; a green goblin that writes code until the test pass; and then some kind of hobgoblin to refactor code, keeping test(s) green - working with the orchestrator to "accept" a given feature as done and move on to the next...
With any luck the resulting code might be a bit more transparent (stricter form) than other LLM code)?
Wasn't there a tool calling benchmark by docker guys which concluded qwen models are nearly as good as GPT? What is your experience about it?
Personally I am convinced JSON is a bad format for LLMs and code orchestration in python-ish DSL is the future. But local models are pretty bad at code gen too.
There's a fine-tune of Qwen3 4B called "Jan Nano" that I started playing with yesterday, which is basically just fine-tuned to be more inclined to look things up via web searches than to answer them "off the dome". It's not good-good, but it does seem to have a much lower effective hallucination rate than other models of its size.
It seems like maybe similar approaches could be used for coding tasks, especially with tool calls for reading man pages, info pages, running `tldr`, specifically consulting Stack Overflow, etc. Some of the recent small MoE models from Chinese companies are significantly smarter than models like Qwen 4B, but run about as quickly, so maybe on systems with high RAM or high unified memory, even with middling GPUs, they could be genuinely useful for coding if they are made to be avoid doing anything without tool use.
I've been using a VM for a sandbox, just to make sure it won't delete my files if it goes insane.
With some host data directories mounted read only inside the VM.
This creates some friction though. Feels like a tool which runs the AI agent in a VM, but then copies it's output to the host machine after some checks would help, so that it would feel that you are running it natively on the host.
This is very easy to do with Docker. Not sure it you want the vm layer as an extra security boundary, but even so you can just specify the VM’s docker api endpoint to spawn processes and copy files in/out from shell scripts.
Hmm, excellent idea, somehow I assumed that it would be able to do damage in a writable volume, but it wouldn't be able to exit it, it would be self-contained to that directory.
One of my biggest, ongoing challenges has been to get the LLM to use the tool(s) that are appropriate for the job. It feels like teach your kids to say, do laundry and you want to just tell them to step aside and let you do it.
The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.
That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.
My assembly Mandelbrot experiment was the thing that made this click for me: https://simonwillison.net/2025/Jul/2/mandelbrot-in-x86-assem...