Something I've realized about LLM tool use is that it means that if you can redu...

vunderba · 2025-07-03T16:47:32 1751561252

> The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide, and how to define the success criteria for the model.

Your test case seems like a quintessential example where you're missing that last step.

Since it is unlikely that you understand the math behind fractals or x86 assembly (apologies if I'm wrong on this), your only means for verifying the accuracy of your solution is a superficial visual inspection, e.g. "Does it look like the Mandelbrot series?"

Ideally, your evaluation criteria would be expressed as a continuous function, but at the very least, it should take the form of a sufficiently diverse quantifiable set of discrete inputs and their expected outputs.

simonw · 2025-07-03T19:06:51 1751569611

That's exactly why I like using Mandelbrot as a demo: it's perfect for "superficial visual inspection".

With a bunch more work I could likely have got a vision LLM to do that visual inspection for me in the assembly example, but having a human in the loop for that was much more productive.

shepherdjerred · 2025-07-03T20:24:33 1751574273

Are fractals or x86 assembly representative of most dev work?

nartho · 2025-07-03T20:40:29 1751575229

I think it's irrelevant. The point they are trying to make is anytime you ask a LLM for something that's outside of your area of expertise you have very little to no way to insure it is correct.

diggan · 2025-07-03T22:11:01 1751580661

> anytime you ask a LLM for something that's outside of your area of expertise you have very little to no way to insure it is correct.

I regularly use LLMs to code specific functions I don't necessarily understand the internals of. Most of the time I do that, it's something math-heavy for a game. Just like any function, I put it under automated and manual tests. Still, I review and try to gain some intuition about what is happening, but it is still very far of my area of expertise, yet I can be sure it works as I expect it to.

shepherdjerred · 2025-07-04T18:25:50 1751653550

I think you're vastly overestimating the amount of knowledge the average developer has in their "area of expertise"

I'm not saying that's a good thing, just that LLMs are no worse than the bottom 50% of devs.

chamomeal · 2025-07-03T15:41:17 1751557277

That’s super cool, I’m glad you shared this!

I’ve been thinking about using LLMs for brute forcing problems too.

Like LLMs kinda suck at typescript generics. They’re surprisingly bad at them. Probably because it’s easy to write generics that look correct, but are then screwy in many scenarios. Which is also why generics are hard for humans.

If you could have any LLM actually use TSC, it could run tests, make sure things are inferring correctly, etc. it could just keep trying until it works. I’m not sure this is a way to produce understandable or maintainable generics, but it would be pretty neat.

Also while typing this is realized that cursor can see typescript errors. All I need are some utility testing types, and I could have cursor write the tests and then brute force the problem!

If I ever actually do this I’ll update this comment lol

rasengan · 2025-07-03T14:29:51 1751552991

Makes sense.

I treat an LLM the same way I'd treat myself as it relates to context and goals when working with code.

"If I need to do __________ what do I need to know/see?"

I find that traditional tools, as per the OP, have become ever more powerful and useful in the age of LLMs (especially grep).

Furthermore, LLMs are quite good at working with shell tools and functionalities (heredoc, grep, sed, etc.).

chrisweekly · 2025-07-03T17:22:58 1751563378

Giving LLMs the right context -- eg in the form of predefined "cognitive tools", as explored with a ton of rigor here^1 -- seems like the way forward, at least to this casual observer.

1. https://github.com/davidkimai/Context-Engineering/blob/main/...

(the repo is a WIP book, I've only scratched the surface but it seems pretty brilliant to me)

nico · 2025-07-03T15:34:23 1751556863

> LLM in a sandbox using tools in a loop, you can brute force that problem

Does this require using big models through their APIs and spending a lot of tokens?

Or can this be done either with local models (probably very slow), or with subscriptions like Claude Code with Pro (without hitting the rate/usage limits)?

I saw the Mandelbrot experiment, it was very cool, but still a rather small project, not really comparable to a complex/bigger/older code base for a platform used in production

simonw · 2025-07-03T15:39:24 1751557164

The local models aren't quite good enough for this yet in my experience - the big hosted models (o3, Gemini 2.5, Claude 4) only just crossed the capability threshold for this to start working well.

I think it's possible we'll see a local model that can do this well within the next few months though - it needs good tool calling, not an encyclopedic knowledge of the world. Might be possible to fit that in a model that runs locally.

nico · 2025-07-03T17:02:05 1751562125

> it needs good tool calling, not an encyclopedic knowledge of the world

I wonder if there are any groups/companies out there building something like this

Would love to have models that only know 1 or 2 languages (eg. python + js), but are great at them and at tool calling. Definitely don't need my coding agent to know all of Wikipedia and translating between 10 different languages

johnsmith1840 · 2025-07-03T18:05:40 1751565940

Given 2 datasets:

1. A special code dataset 2. A bunch of "unrelated" books

My understanding is that the model trained on just the first will never beat the model trained on both. Bloomberg model is my favorite example of this.

If you can squirell away special data then that special data plus everything else will beat the any other models. But that's basically what openai, google, and anthropic are all currently doing.

e12e · 2025-07-03T19:21:15 1751570475

I wonder if common lisp with repl and debugger could provide a better tool than your example with nasm wrapped via apt in Docker...

Essentially just giving LLMs more state of the art systems made for incremental development?

Ed: looks like that sort of exists: https://github.com/bhauman/clojure-mcp

(Would also be interesting if one could have a few LLMs working together on red/green TDD approach - have an orchestrator that parse requirements, and dispatch a red goblin to write a failing test; a green goblin that writes code until the test pass; and then some kind of hobgoblin to refactor code, keeping test(s) green - working with the orchestrator to "accept" a given feature as done and move on to the next...

With any luck the resulting code might be a bit more transparent (stricter form) than other LLM code)?

never_inline · 2025-07-03T16:59:23 1751561963

Wasn't there a tool calling benchmark by docker guys which concluded qwen models are nearly as good as GPT? What is your experience about it?

Personally I am convinced JSON is a bad format for LLMs and code orchestration in python-ish DSL is the future. But local models are pretty bad at code gen too.

pxc · 2025-07-03T16:06:34 1751558794

There's a fine-tune of Qwen3 4B called "Jan Nano" that I started playing with yesterday, which is basically just fine-tuned to be more inclined to look things up via web searches than to answer them "off the dome". It's not good-good, but it does seem to have a much lower effective hallucination rate than other models of its size.

It seems like maybe similar approaches could be used for coding tasks, especially with tool calls for reading man pages, info pages, running `tldr`, specifically consulting Stack Overflow, etc. Some of the recent small MoE models from Chinese companies are significantly smarter than models like Qwen 4B, but run about as quickly, so maybe on systems with high RAM or high unified memory, even with middling GPUs, they could be genuinely useful for coding if they are made to be avoid doing anything without tool use.

dist-epoch · 2025-07-03T14:54:40 1751554480

I've been using a VM for a sandbox, just to make sure it won't delete my files if it goes insane.

With some host data directories mounted read only inside the VM.

This creates some friction though. Feels like a tool which runs the AI agent in a VM, but then copies it's output to the host machine after some checks would help, so that it would feel that you are running it natively on the host.

jitl · 2025-07-03T15:00:11 1751554811

This is very easy to do with Docker. Not sure it you want the vm layer as an extra security boundary, but even so you can just specify the VM’s docker api endpoint to spawn processes and copy files in/out from shell scripts.

simonw · 2025-07-03T15:39:54 1751557194

Have you tried giving the model a fresh checkout in a read-write volume?

dist-epoch · 2025-07-03T15:58:01 1751558281

Hmm, excellent idea, somehow I assumed that it would be able to do damage in a writable volume, but it wouldn't be able to exit it, it would be self-contained to that directory.

skeeter2020 · 2025-07-03T20:39:22 1751575162

One of my biggest, ongoing challenges has been to get the LLM to use the tool(s) that are appropriate for the job. It feels like teach your kids to say, do laundry and you want to just tell them to step aside and let you do it.