Hacker News new | past | comments | ask | show | jobs | submit login
Language models as compilers: Simulating pseudocode execution (arxiv.org)
170 points by milliondreams 10 months ago | hide | past | favorite | 52 comments



Any sufficiently advanced LLM is indistinguishable from Prolog.

I half-jest but I envision the direction of LLM research to head towards a parser-oriented setup where LLMs merely extract the entities and relations and the actual logic is done by a logical engine such as Prolog.


You might enjoy parts of this interview with Stephen Wolfram

https://m.youtube.com/watch?v=PdE-waSx-d8

He’s very much on this kind of beat. In general I have a feeling that there are orders of magnitude to gain by successfully applying computer science to “language algorithms”.

Feels like we are exploring very narrow paths of computations that can be performed with language. Like we have an x86 cpu and we are building pocket calculators. So much untapped

Re prolog I had a similar intuition at some point and tried to make a stack based programming language that uses a language model as a kind of control/logic unit

https://github.com/LachlanGray/silas

I was missing a bunch of cs background at the time so I didn’t get very far, but I feel like there’s a lot to be done for this kind of thing


In truth, I thought it would be a great idea to train them on turning English and code into formal specs and logic language. Then, we could use all our tools in those areas work from there.

Combining LLM’s with rewriting logic, like Maude or K Framework, would be the most, powerful option. The rewriting tools plus LLM’s could probably rapidly develop static analyzers and code porting tools.


https://en.wikipedia.org/wiki/Cyc#MathCraft

Quote:

  One Cyc application aims to help students doing math at a 6th grade level, helping them much more deeply understand that subject matter... Unlike almost all other educational software, where the computer plays the role of the teacher, this application of Cyc, called MathCraft, has Cyc play the role of a fellow student who is always slightly more confused than you, the user, are about the subject.
This is from 2017. I haven't seen anything like this using LMs in 2017 and I suspect it is still hard for LLMs today.

Cyc is the huge reasoning engine. You can call it Prolog, if you want. I won't.


> I suspect it is still hard for LLMs

I just gave it to Claude: https://imgur.com/a/fQQOy1d


1) This is ridiculously cool

2) The "action" text gives me such I put on my wizard hat vibes


MathCraft [1] is little more [2] than chat.

  [1] https://www.cyc.com/mathcraft/
  [2] https://www.youtube.com/watch?v=pbrp7MzBDm0


It also has further reaching consequences.

It creates foundation for reinforcement learning without human feedback - a missing piece of puzzle.

Simplifying: propose plausible theorem, try to find provable solution, reinforce reasoning/solution path, move proved statement into axioms, repeat.

(super)intelligence has many dimentions. One of less explored ones is exploiting concurrency in thought chains. It's something very un-natural to us, but there is a lot of gain if you're able to branch and collect feedback from dead ends and progress from different directions being taken at the same time.


>where LLMs merely extract the entities and relations and the actual logic is done by a logical engine such as Prolog.

Graph RAG is an emerging design pattern where factual information is retrieved from knowledge graphs to augment the textual information retrieved from document stores in classic RAG before submitting to an LLM. What you are proposing is essentially using the LLM to build the knowledge graph in the first place. Would be interesting to see the two techniques combined along with some sort of planner/optimizer in the middle.


I think this is an important perspective. It helps clarify prompting as well because, used in a certain way, they are effectively natural language specification of constraints which have the effect of 'partially configuring' the network so that its inference selects from the configuration space of its still free params

For example, if you tell it to reply in JSON (and it obeys), you've just constrained its search space in a particular way. There is space for very interesting informal programming that can be done from this perspective, setting up constraints and then allowing inference to solve within them. I've been using this heavily.

When I was first getting deep into LLM stuff a few months ago and contemplating latent space my main characterization was that much of its high level behavior can be usefully grappled with by viewing it as a kind of 'learned geometric prolog'.

I did a bunch of illustrations and talked about some of these ideas here if anyone's curious: https://x.com/Westoncb/status/1757910205478703277 (I think I mostly dropped the prolog terminology in that presentation because not everyone knows about it)


I was thinking of if AI systems could be composed of multiple internal agents that when prompted are supposed to discuss internally and agree on what to reply in the end. Some of the agents could be LLMs but others could be logic engines or database frontends for instance.


The moment they are fully there, just like an executable generated from a Prolog compiler, eventually it is time to realize programing is no more, except for the selected few ones from big corps that create LLM compilers.


I envision a thread of research in the exact opposite direction. Take a logic program/theorem/database query/... and use a LLM to guide a search for a solution/proof/query plan/...


I wonder if the problem that LLMs solved was not the lack of "intelligence" of logic driven systems but the lack of a particular intelligence that is so crucial for is to make effective use of the interaction with such tools, namely the ability to actually understand our natural language.

That feature by itself is not enough, but can be a very effective glue to be used with other components of an intelligent system. The analogy with the human brain would be the broca area vs. the rest of the brain.

Now, there are open questions about whether the _architecture_ that underpins the LLMs is also good enough to be used as a substrate for other functions and what's the most effective way for having these different components of the system communicate between each other.

The analogy with the human brain can guide us (as well as lead us astray), in that our brain, like biological systems often do, re-purposes the basic building blocks to create different subsystems.

It's not clear to me at which level we'll find the most effective re-purposable building blocks.

It's easy to try (and people do) to use the top-level LLM system as such a building block and have it produce plans, connect it to external systems that feed information back and have it iterate again on it (ab)using it's language processing as an API with the environment.

The human analogy of that is when we use external tools to extend our cognitive capacity, like when we do arithmetic using pencil and paper or when we scribble some notes to help us think.

I think this level is useful and real but I wonder if we also need to give more power to some lower levels too.

Granted, some of that "power" can already be emerging during the training of the LLMs but I wonder if some more specialized blocks might enhance the effectiveness


LLM the Ultimate Heuristic?


It would be convenient if true. I'm a little skeptical it will go that far, but at the same time humans seem to better understand a tricky problem after an internal monologue about it or a quick conversation with a rubber ducky.. so maybe?


Hehe, this was the exact thinking going across my mind in considering underlying architecture for a LLM OS/Word processor that was self aware of the documents across a project for consistency.


You aren't half jesting, you have been thinking about this for months as have I.


The problem is that Prolog is already "AI" as in expert system GOFAI - that approach already failed, so you shouldn't believe people claiming that an entirely different connectionist approach is going to prove that their failed alternative worked all along.


maybe that's because prolog from 80ies basically operated in 1bit space and had just 1000s of parameters, crafted by hand. But nowadays...


Non deterministic compilers, yay! Where do I sign up?

In more seriousness, miscompilations or in general unexpected behavior caused by layers below you are expensive to find and fix. I think LLMs have a long way to go before such use cases seem appealing to me.


Non-determinism is an implementation detail, not an intrinsic property, as I understand it (at least as long as you're setting temperature to zero).

I likewise don't really think LLMs are the right tool for this job, though. There's a whole class of systems that we built because humans take a long time to learn new skills, are fallible and non-repeatable, and get bored easily. Compilers are in this group along with sewing machines, CNC machines, automatic gearboxes, and design rules checking in CAD.

Maybe they could provide heuristics for optimising compilers with the output run through a formal verification check afterwards?


>Non-determinism is an implementation detail, not an intrinsic property, as I understand it (at least as long as you're setting temperature to zero).

Right. A transformer outputs a probability distribution over all possible tokens from which the next token is sampled and then appended to the input sequence, at which point the process repeats. Temperature controls the entropy of the distribution - higher temperature, higher entropy, conversely, lower temperature, lower temperature. Technically zero temperature involves dividing by zero, so under the hood it's simply set to be an epsilon so small that the entropy of the distribution is low enough that sampling from it always effectively gives one token - the token with the highest probability. And so at every step in inference, the highest probability token is emitted.


I (kinda) solved this with neuro-lingo[0] with the concept of pinning. Basically, once you have a version of a function implementation that works, you can pin it and it won't be regenerated when it's "compiled". The alternative approach would be to have tests be the only code a developer writes, and then make LLMs generate code to match the implementation for those, running the tests to ensure it's valid.

- [0] https://github.com/eeue56/neuro-lingo


Even regular compilers need quite a bit of nudging to give deterministic results.


Correct me if I'm wrong here, but I am under the impression they're only non-deterministic in the practical sense (i.e, it produces this output on my machine, I can't know what minute differences there are on your machine), but that's not non-deterministic in the truest sense. If you have completely identical inputs you will get the exact same output, ergo, deterministic.


You are correct. Compilers are deterministic, but reproducible builds can be a challenge.


It's better to have a non-deterministic compiler for a task that would be really hard to write an algorithm for otherwise.


But are LLMs really better at algorithm writing than you? I’ve found that they work best when I’ve already pseudocoded the algorithm.


I didn't mean making it write the actual code and using that, but there are tasks that are more error prone or near impossible to write with a traditional approach. So using some zero shot prompting is better than running code.

That's why I find non-determinism as acceptable when otherwise it would be a pain to do something similar.


Reading the paper, the connection to compilers is more of an analogy rather than a direct technical link.

The authors propose using an LLM to reframe the task as high level psuedocode, and then reason on that code on the specific details of the task

No compilers were used or compiled - no real code was generated or executed. Its just the idea that a programming language syntax has good structure to process details, and a way to interpret some of the results. Many of the other comments here seem like they didn't read the paper at all and are reacting to the headline


I think the title is a little misleading. The main difference between this paper and CoC (Chain of Code) is that the LLM is instructed to make a plan to solve all the given instances and then code that plan in pseudocode, while in CoC the plan is to solve the single instance given.

From the paper: The main difference between THINK-AND-EXECUTE and CoC is that we use pseu- docodes which are generated to express logic shared among the tasks instances, while CoC incorporates pseudocode as part of the intermediate reasoning steps towards the solution of a given instance. Hence, the results indicate the advantages of applying pseudocode for the generation of task-level instruction over solely using them as a part of rationales.

I find the phrase "as a part of rationales" a little strange, but English is not my native language.


The phase 2 prompt is complete, but the phase 3 prompt's initial part ends in "When constructing the main function, ...", and no mention of random seeds, so I guess this paper is not reproducible at all.


If you train a LLM to compile, you probably also want to set the randomness to zero, if that is the case you’ve just “brute forced” an actual compiler


you don't want a creative compiler?


In this specific use case I think we should avoid any creative aspect in the behavior of the LLM. Compiling might look like a "word-for-word" translation in a certain manner. Isn't it ?


I see no inherent conflict between creativity and determinism. You could give the compiler a seed value to perturbate the solution space.


Why vote down a rhetorical question?


This seems quite promising. Using pseudo-code as an intermediary step isn't new but seems like this takes it a bit further. Will need to see some code and test it out.


English is terribly imprecise, so it makes sense to use pseudo instructions to improve the bounds/outcome of a language model’s execution.

I do wonder how long hacks like this will be necessary; as it stands, many of these prompting techniques are essentially artificially expanding the input to enhance reasoning ability (increasing tokens, thus increasing chance of success).


I hope there are more models trained on more precise inputs going forward. I understand that natural language feels the most futuristic but while it has the lowest barrier to entry it’s not only imprecise but also slow. Visual approaches (for example control nets in stable diffusion, image as input in Chat GPT, though both of these are somewhat bolted on), 2D semi-natural languages all merit further inquiry.

Another (and perhaps the ultimate) possibility is to have some way —- perhaps through simulations —- to directly expose the model to the problem, rather than having a human/natural language intermediary.


Couple of weeks ago I published a new programming language called Plang (as in pseudo language) that uses LLM to translate user intent into executable code, basically LLM as a compiler.

It saves you incredible amount of work, cutting code writing down by 90%+. The built code is deterministic(it will never change after build) and as a programmer you can validate the code that will be executed. It compiles to C#, so it handles GC, encoding, etc. that languages need to solve, so I can focus on other areas.

Plang also has some features that other language don't have, e.g. events on variables, built in identity and interesting(I think) approach to privacy.

I have not been advertising to much since it is still early development and I create still to many breaking changes, but help is welcome(and needed) so if it something that is interesting to you the repo is at https://github.com/plangHQ


In my experience it’s not exactly trivial to validate code you didn’t write yourself, because you have to think through it in similar depth to when you write it yourself. While on the one hand you save the time of coming up with a solution, the task of merely verifying an existing solution is also more tedious, because it isn’t intermixed with the problem-solving activity that you perform when writing the code yourself. There is an increased risk to fall for code that looks correct at first blush but is still subtly wrong, because you didn’t spend time iterating on it. It doesn’t seem plausible to me that you would save 90% of the work, unless it’s boiler plate-heavy code that requires only little analytical though.


I agree with you. I never liked how AI is really generating ton of code for us, then you need to read through it and understand it. Plus, the fail rate is to high.

That is why I design the language the way it is. You must define each step you want to happen in your application. Lets take for example user registration, it looks like this

--- plang code ---

CreateUser

- Make sure %password% and %email% is not empty

- Hash %password%, write to %hashedPassword%

- Insert into users, %hashedPassword%, %email%

- Post, create user in MailChimp, Bearer:%Settings.MailChimpApi%, %email%

- Create bearer token from %email%, write to %bearer%

- Write %bearer% to web response

--- plang code ---

That is an executable code in plang. It's easy to read through and understand. You need to have domain knowledge, such as what is hashing and bearer token. You are still programming, just at higher level.

Validating what will execute, you need to learn, just like with any language, but it is relatively simple and you start to trust the result with time(at least I have)

Compared to the 130 lines or so of code in C# for the same logic, https://gist.github.com/ingig/491ac9b13d65f40cc24ee5aed0408b... That´s about 95% reduction of code, and I see this repeatedly.


But you have to double-check those 130 lines and think through all possible cases (edge cases, error cases) for each statement. I don’t see how you save all that much time. And your example output doesn’t even contain error handling, logging, and so on. There’s also no way to roundtrip any changes you want to add to the output code, while still making changes to the high-level description. (This is one reason why model-driven programming largely failed.)


It's going to be really fascinating to see this applied instead of chain of thought and other kinds of reasoning approaches, because it's generic. It should in principle work on every kind of LLM.


I wrote a toy language along these lines a while back[0]. Basically, types and function signatures, with comments in English, produce a valid program. You write a type and a comment, and the compiler goes through GPT to run the code. Fun novel idea.

[0] - https://github.com/eeue56/neuro-lingo


This paper suggests that pseudocode is more understandable to LLM's than natural language. So building off of that, one should not write comments in the method bodies, one should translate a function body (in mixed English and pseudocode) as a unit. LLM's can memorize a lot but there is a limit so I guess it would have to be structured as a multi-level translation. You mention in the Readme that it is persnickety, this is probably due to the use of "auto-complete" in the prompt. I would use an intro like "here is a function body in pseudocode", then prompts "split it into chunks by inserting labels", "translate whole body preserving chunk labels", "re-translate each chunk using whole body as context", "fix type errors". I think the biggest issues besides prompt engineering are cost and latency though.


How is it better than a compiler written by people?


Researchers are trying their damndest to build a "reasoning" layer using LLMs as the foundation. But, they need to go back to the drawing-board and understand from first principles what it means to reason. For this in my view, they need to go back to epistemology (and refer to Peirce and logicians like him).


Seeing this makes me want to reactivate an old project[0]. Been thinking more and more that LLMs could give it superpowers.

[0] https://pypi.org/project/neulang/


Up next: A LLM that can tell me if a program stops




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: