Hacker Newsnew | past | comments | ask | show | jobs | submit | radarsat1's commentslogin

> Is it speed?

> Is it that you can backprop through this computation? Do you do so?

With respect, I feel that you may not have read the article.

> Because the execution trace is part of the forward pass, the whole process remains differentiable: we can even propagate gradients through the computation itself. That makes this fundamentally different from an external tool. It becomes a trainable computational substrate that can be integrated directly into a larger model.

and,

> By storing points across nested convex hulls, this yields a decoding cost of O(k+log⁡ n).

and,

> Regardless of their eventual capability ceiling, they already suggest a powerful systems primitive for speeding up larger models.

So yes, and yes.

> Where are the benchmarks?

Not clear what they should benchmark it against. They do compare speed to a normal KV Cache. As for performance.. if it's actually executing a Sudoku solver with a 100% success rate, it seems pretty trivial to find any model doing < 100% success rate. Sure, it would be nice to see the data here, agree with you there.

Personally I think it would be really interesting to see if this method can be combined with a normal model MoE-style. It is likely possible, the router module should pick up quite quickly that it predicts the right tokens for some subset of problems deterministically. I like the idea of embed all sorts of general solvers directly into the model, like a prolog solver for example. In fact it never would have occurred to me to just go straight for WASM, pretty interesting choice to directly embed a VM. But it makes me wonder what "smaller" interpreters could be useful in this context.


I read the article and had the same question. It's written in such a way that it feels like it's answering these questions without actually doing so.

The right thing to benchmark against isn't a regular transformer, it's a transformer that writes programs that are then interpreted. They have a little visual demo where it looks faster but only because they make Python absurdly slow, and it's clearly not meant to be a real benchmark.

I spent the whole article thinking, wow, cool, but also ... how is this better than an LLM steering a regular computer? The closest we get is a statement about the need to "internalize what computation is" which doesn't say anything to me.

Fundamentally, running actual instructions on a real CPU is always going to be faster than running them via a neural network. So the interesting part is where they say you can backprop through it, but, ok, backprop is for cases where we don't know how to encode a function using strict logic. Why would you try and backprop through a Sudoku solver? It's probably my imagination is just limited but I could have used more on that.


Benchmark it against a fast Python interpreter optimized for AI tool calling, like Monty: https://github.com/pydantic/monty

Did you read the post you are responding to? It says:

> What's the benefit? Is it speed? Where are the benchmarks? Is it that you can backprop through this computation? Do you do so?

The correct parsing of this is: "What's the benefit? [...] Is it [the benefit] that you can backprop through this computation? Do you do so?"

There are no details about training nor the (almost-certainly necessarily novel) loss function that would be needed to handle partial / imperfect outputs here, so it is extremely hard to believe any kind of gradient-based training procedure was used to determine / set weight values here.


> There are no details about training

my understanding was that they are not training at all, which would explain that. they are compiling an interpreter down to a VM that has the shape of a transformer.

ie they are calculating the transformer weights needed to execute the operations of the machine they are generating code for.


This is my interpretation as well.

EDIT: Actually, they do make this clear(ish) at the very end of the article, technically. But there is a huge amount of vagueness and IMO outright misleading / deliberately deceptive stuff early on (e.g. about potential differentiability of their approach, even though they admit later they aren't sure if the differentiable approach can actually work for what they are doing). It is hard to tell what they are actually claiming unless you read this autistically / like a lawyer, but that's likely due to a lack of human editing and too much AI assistance.


I'm curious if 1-bit params can be compared to 4- or 8-bit params. I imagine that 100B is equivalent to something like a 30B model? I guess only evals can say. Still, being able to run a 30B model at good speed on a CPU would be amazing.

At some point you hit information limits. With conventional quantisation you see marked capability fall-off below q5. All else being equal you'd expect an N-parameter 5-bit quant to be roughly comparable to a 3N-parameter ternary, if they are trained to the same level, just in terms of the amount of information they can possibly hold. So yes, 100B ternary would be within the ballpark of a 30B q5 conventional model, with a lot of hand-waving and sufficiently-smart-training

I assume that theoretically, 1-bit models could be most efficient because modern models switched from 32 bit to 16 bit to 8 bit per parameter (without quantization).

It's not clear where the efficiency frontier actually is. We're good at measuring size, we're good at measuring FLOPS, we're really not very good at measuring capability. Because of that, we don't really know yet whether we can do meaningfully better at 1 bit per parameter than we currently get out of quantising down to that size. Probably, is the answer, but it's going to be a while before anyone working at 1 bit per param has sunk as many FLOPS into it as the frontier labs have at higher bit counts.

The thing with efficiency is that it is relative to both inference and training compute. If you do quantization, you need a more powerful higher precision model to quantize from, which doesn't exist if you want to create a frontier model. In this case the question is only whether you get better inference and/or training performance from training e.g. a native 1 bit model.

Currently the optimal training precision seems to be 8 bit (at least used by DeepSeek and some other open weight companies). But this might change with different training methods optimized for 1-bit training, like from this paper I linked before: https://proceedings.neurips.cc/paper_files/paper/2024/hash/7...


The paper has performance comparisons towards the end.

https://arxiv.org/abs/2402.17764


it also reminds me a bit of this diffusion paper [1] which proposes having an encoding layer and a decoding layer but repeats the middle layers until a fixed point is reached. but really there is a whole field of "deep equilibrium models" that is similar. it wouldn't be surprising if large models develop similar circuits naturally when faced with enough data.

finding them on the other hand is not easy! as you've shown, i guess brute force is one way.. it would be nice to find a short cut but unfortunately as your diagrams show, the landscape isn't exactly smooth.

I would also hypothesize that different circuits likely exist for different "problems" and that these are messy and overlapping so the repeated layers that improve math for example may not line up with the repeated layers that improve poetry or whatever, meaning the basic layer repetition is too "simple" to be very general. that said you've obviously shown that there is some amount of generalizing at work, which is definitely interesting.

[1] https://arxiv.org/abs/2401.08741


This is interesting because I've been considering a similar project. I maintain a package for a scientific simulation codebase, it's all in Fortran and C++ with too much template code, which takes ages to build and is very error prone, and frankly a pain to maintain with its monstrous CMake spaghetti build system. Furthermore the whole thing would benefit with a rewrite around GPU-based execution, and generally a better separation between the API for specifying the simulation and the execution engine. So I've been thinking of rewriting it in Jax and did an initial experiment to port a few of the main classes to Python using Gemini. It did a fairly good job. I want to continue with it, but I'm also a bit hesitant because this is software that the upstream developers have been working on for 20+ years. The idea of just saying to them "hey look I rewrote this with AI and it's way better now" is not something I would do without giving myself pause for thought. In this case it's not about the license, they already use a permissive one, but just the general principle of suggesting a "replacement" for their work.. if I was doing it by hand it might be different, I don't know, they might appreciate that more, but I have no interest in spending that much time on it. Probably what I will do is just present the PoC and ask if they think it's worth attempting to auto-convert everything, they might be open to it. But yeah, the possibilities of auto-transpiling huge amounts of software for modernization purposes is a really interesting application of AI, amazing to think of all the possibilities. But I'm happy to have read the article because I certainly didn't think about the copyright implications.

If you really want to do that, the sensible thing is to keep it separate from the original and respect the original license. There would have been no outcry if that happened with chardet. If the different package is genuinely better, it will be used.

I think your last point raises the following question: how would you change your answer if you know they read all about guns and death and how one causes the other? What if they'd seen pictures of guns? And pictures of victims of guns annotated as such? What if they'd seen videos of people being shot by guns?

I mean I sort of understand what you're trying to say but in fact a great deal of knowledge we get about the world we live in, we get second hand.

There are plenty of people who've never held a gun, or had a gun aimed at them, and.. granted, you could argue they probably wouldn't read that line the same way as people who have, but that doesn't mean that the average Joe who's never been around a gun can't enjoy media that features guns.

Same thing about lots of things. For instance it's not hard for me to think of animals I've never seen with my own eyes. A koala for instance. But I've seen pictures. I assume they exist. I can tell you something about their diet. Does that mean I'm no better than an LLM when it comes to koala knowledge? Probably!


It’s more complicated to think about, but it’s still the same result. Think about the structure of a dictionary: all of the words are defined in terms of other words in the dictionary, but if you’ve never experienced reality as an embodied person then none of those words mean anything to you. They’re as meaningless as some randomly generated graph with a million vertices and a randomly chosen set of edges according to some edge distribution that matches what we might see in an English dictionary.

Bringing pictures into the mix still doesn’t add anything, because the pictures aren’t any more connected to real world experiences. Flooding a bunch of images into the mind of someone who was blind from birth (even if you connect the images to words) isn’t going to make any sense to them, so we shouldn’t expect the LLM to do any better.

Think about the experience of a growing baby, toddler, and child. This person is not having a bunch of training data blasted at them. They’re gradually learning about the world in an interactive, multi-sensory and multi-manipulative manner. The true understanding of words and concepts comes from integrating all of their senses with their own manipulations as well as feedback from their parents.

Children also are not blank slates, as is popularly claimed, but come equipped with built-in brain structures for vision, including facial recognition, voice recognition (the ability to recognize mom’s voice within a day or two of birth), universal grammar, and a program for learning motor coordination through sensory feedback.


> The hardest part was figuring out OpenLDAPs configuration syntax, especially the correct ldif incantations ..

As a long time Linux user on personal machines, I found myself for the first time a couple of years ago needing to support a small team and given them all login access to our small cluster. I figured, hey it's annoying to coordinate user ids over these machines, I should just set up OpenLDAP.. little did I know.. honestly I'm pretty handy at dealing with Linux but I was shocked to discover how complicated and annoying it was to set up and use OpenLDAP with NFS automounting home directories.

For the first time in my life I was like, "oh this is why people spend years studying system administration.."

I did get it working eventually but it was hard to trust it and the configuration GUI was not very good and I never fully got passwd working properly so I had to intervene to help people change their passwords.. in the end we ended up just using manually coordinated local accounts.

The whole time I'm just thinking, I must be missing something, it can't be this bad.. I'm still a bit flabbergasted by the experience.


> and the quality of the result can be measured automatically

this part is nontrivial though


> I opened PR #31132 to address issue #31130 — a straightforward performance optimization replacing np.column_stack() with np.vstack().T().

> The technical facts: - np.column_stack([x, y]): 20.63 µs - np.vstack([x, y]).T: 13.18 µs - 36% faster

Does anyone know if this is even true? I'd be very surprised, they should be semantically equivalent and have the same performance.

In any case, "column_stack" is a clearer way to express the intention of what is happening. I would agree with the maintainer that unless this is a very hot loop (I didn't look into it) the sacrifice of semantic clarity for shaving off 7 microseconds is absolutely not worth it.

That the AI refuses to understand this is really poor, shows a total lack of understanding of what programming is about.

Having to close spurious, automatically-generated PRs that make minor inconsequential changes is just really annoying. It's annoying enough when humans do it, let alone automated agents that have nothing to gain. Having the AI pretend to then be offended is just awful behaviour.


The benchmarks are not invented by the LLM, they are from an issue where Scott Shambaugh himself suggests this change as low-hanging, but low importance, perf improvement fruit:

https://github.com/matplotlib/matplotlib/issues/31130


Ah fair enough. But then it seems the bot completely ignored the discussion in question, there's a reason they spent time evaluating and discussing it instead of just making the change. Having a bot push on the issue that the humans are already well aware of is just as bad behaviour.


It's cool but do I really want a single browser tab downloading 2.5 GB of data and then just leaving it to be ephemerally deleted? I know the internet is fast now and disk space is cheap but I have trouble bringing myself around to this way of doing things. It feels so inefficient. I do like the idea of client-side compute, but I feel like a model (or anything) this big belongs on the server.


I don't think local as it stands with browsers will take off simply from the lead time (of downloading the model), but a new web API for LLMs could change that. Some standard API to communicate with the user's preferred model, abstracting over local inference (like what Chrome does with Gemini Nano (?)) and remote inference (LM Studio or calling out to a provider). This way, every site that wants a language model just has to ask the browser for it, and they'd share weights on-disk across sites.


It sounds good, but I'm not sure that in practice sites will want to "let go" of control this way, knowing that some random model can be used. Usually sites with chatbots want a lot of control over the model behaviour, and spend a lot of time working on how it answers, be it through context control, guardrails or fine tuning and base model selection. Unless everyone standardizes on a single awesome model that everyone agrees is the best for everything, which I don't see happening any time soon, I think this idea is DOA.

Now I could imagine such an API allowing to request a model from huggingface for example, and caching it long term that way, yes just like LM Studio does. But doing this based on some external resource requesting it, vs you doing it purposefully, has major security implications, not to mention not really getting around the lead time problem you mention whenever a new model is requested.


There will always be someone unhappy for literally any aspect of something new. Finding 2.5gb for a local LLM problematic in 2026, I really cannot think what is safe anymore.

We went from impossible to centralised to local in a couple of years and the "cost" is 2.5gb of hard drive.


I didn't say that 2.5gb is unreasonable for an LLM. I said it's an unreasonable payload size for a website. Not the same.


> The ai tooling reverses this where the thinking is outsourced to the machine and the user is borderline nothing more than a spectator, an observer and a rubber stamp on top.

I find it a bit rare that this is the case though. Usually I have to carefully review what it's doing and guide it. Either by specific suggestions, or by specific tests, etc. I treat it as a "code writer" that doesn't necessarily understand the big picture. So I expect it to fuck up, and correcting it feels far less frustrating if you consider it a tool you are driving rather than letting it drive you. It's great when it gets things right but even then it's you that is confirming this.


This is exactly what I said in the end. Right now you rely on it fucking things up. What happens to you when the AI no longer fucks things up? Sorry to say, but your position is no longer needed.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: