Discovering algorithms by enumerating terms in Haskell

bmc7505 · 2024-08-03T20:11:00 1722715860

Maciej Bendkowski has some related work [1] on generating random lambda terms, but was unable to overcome what he calls the asymptotic sparsity problem:

  Sampling simply-typed terms seems notoriously more challenging than sampling closed ones. Even rejection sampling, whenever applicable, admits serious limitations due to the imminent asymptotic sparsity problem — asymptotically almost no term, be it either plain or closed, is at the same time (simply) typeable. [...] Asymptotic sparsity of simply-typed λ-terms is an impenetrable barrier to rejection sampling techniques. As the term size tends to infinity, so does the induced rejection overhead. In order to postpone this inevitable obstacle, it is possible to use dedicated mechanisms interrupting the sampler as soon as it is clear that the partially generated term cannot be extended to a typeable one. The current state-of-the-art samplers take this approach, combining Boltzmann models with modern logic programming execution engines backed by highly-optimised unification algorithms. Nonetheless, even with these sophisticated optimisations, such samplers are not likely to generate terms of sizes larger than one hundred.

I would be curious to see a more rigorous analysis of the sample complexity of generating well-typed expressions in, e.g., the STLC. Maybe there is a way to avoid or reduce the rejection rate before evaluation.

[1]: https://arxiv.org/pdf/2005.08856

nextos · 2024-08-03T21:08:09 1722719289

There are quite a few publications that explore the concept of generating programs, either using typed or untyped functional languages.

For example, wake-sleep learning and NN on a Lisp-like language [1] or synthesis of Haskell guided by refinement types [2].

[1] https://royalsocietypublishing.org/doi/full/10.1098/rsta.202....

[2] https://dl.acm.org/doi/abs/10.1145/2980983.2908093

bmc7505 · 2024-08-03T22:08:45 1722722925

The trick is not just synthesizing valid functions, but doing so in a parallel communication-free manner, without compromising soundness or completeness. You want to massively scale up a discrete sampler without replacement. One very efficient way of doing this is by constructing an explicit bijection from the sample space to the integers, sampling integers, then decoding them into programs.

While this technique enjoys certain advantages, i.e., it is embarrassingly parallelizable and guaranteed to enumerate distinct solutions with a bounded delay, it also somewhat unnatural. By flattening the distribution onto the integers a la Gödel numbering, it destroys locality, does not play well with incremental decoding methods (left-to-right is currently en vogue in generative language modeling), and will fail if the sample space is uncountable.

Another key step is reducing symmetries in your sample space by quotienting it somehow (e.g., by α-equivalence). The author seems to be invoking some kind of equivalence relation by “superposition”, but the technical details here are a little fuzzy.

This problem is also closely related to model counting in the CSP literature, so a practical speedup could lead to improvements on a lot of interesting downstream benchmarks.

In general, the problem of program induction from input-output examples is not well-posed, so specialized solvers that can make stronger assumptions will usually have an advantage on domain-specific benchmarks. Most existing program synthesizers do not satisfy all of these desiderata (e.g., soundness, completeness, naturalness, incrementality).

YeGoblynQueenne · 2024-08-04T00:16:36 1722730596

>> The trick is not just synthesizing valid functions, but doing so in a parallel communication-free manner, without compromising soundness or completeness.

Right! A great way to do this is to learn a program by using it to prove the training examples while it is being learned. A very cool ability that some systems of the new wave of Inductive Logic Programming can pull off, but probably nothing else can far as I can tell.

YeGoblynQueenne · 2024-08-04T00:13:44 1722730424

>> There are quite a few publications that explore the concept of generating programs, either using typed or untyped functional languages.

That's Inductive Functional Programming (IFP), a kind of Inductive Programming that also includes Inductive Logic Programming (ILP). The canonical example of IFP is Magic Haskeller:

https://nautilus.cs.miyazaki-u.ac.jp/~skata/MagicHaskeller.h...

As an example of a modern ILP I suggest Popper:

https://github.com/logic-and-learning-lab/Popper/

Or Louise (mine):

https://github.com/stassa/louise

One of the DreamCoder papers describes Inductive Programming as a form of weakly supervised learning, in the sense that such systems learn to generate programs not from examples of programs, but from examples of the target programs' beuav908rs, i.e. their inputs and outputs. By contrast LLMs or slightly older neural program synthesis systems are trained on examples that consist of pairs of (programming-task, program-solving-the-task).

Another way to see the difference between Inductive Programming systems and conventional machine learning systems used for program synthesis is that Inductive Programming systems learn by solving problems rather than from observing solutions.

The advantage is that, in this way, we can learn programs that we don't know how to write (because we don't have to generate examples of such programs) whereas with conventional machine learning we can only generate programs like the ones the system's been trained on before.

Another advantage is that it's much easier to generate examples. For instance, if I want to learn a program that reverses a list, I give some examples of lists and their reverse, e.g. reverse([a,b,c],[c,b,a]) and reverse([1,2,3],[3,2,1]) whereas e.g. an LLM must be trained on explicit examples of list-reversing programs; like, their source code.

IFP and ILP systems are also very sample efficient, so they only need a handful of examples, often just one, whereas neural net-based systems may need millions (no exaggeration- can give a ref if needed).

The disadvantage is that learning a program usually (but not always - see Louise, above) implies searching a very large combinatorial space and that can get very expensive, very, very fast. But, there are ways around that.

nextos · 2024-08-05T14:42:10 1722868930

Thanks, I know ILP quite well and also your research, Muggleton et al, etc.

It is a very interesting field which I hope makes a comeback once systems become neurosymbolic.

YeGoblynQueenne · 2024-08-05T21:01:52 1722891712

Knowin ILP at all, let alone well, is so rare that I totally misread your comment.

I'm one of the last purely symbolic hold-outs in ILP I guess. But if you're interested in recent neurosymbolic work using MIL this is some recent work from my colleagues:

Abductive Knowledge Induction from Raw Data

https://www.ijcai.org/proceedings/2021/254

CyberDildonics · 2024-08-06T18:56:59 1722970619

once systems become neurosymbolic

What does this mean?

nextos · 2024-08-07T17:38:50 1723052330

There's a whole research area that tries to combine some ideas from symbolic AI with neural architectures.

IMHO, this might eventually become mainstream. For example, see all the work that merges theorem provers with RL and NN.

CyberDildonics · 2024-08-07T18:03:10 1723053790

What are examples of "symbolic AI", "neural architecture" and a system that is "neurosymbolic" ?

optimalsolver · 2024-08-11T08:15:58 1723364158

Almost everything that comes out of Deep Mind.

CyberDildonics · 2024-08-12T15:36:32 1723476992

What makes it symbolic?

optimalsolver · 2024-08-12T19:02:37 1723489357

There's usually a DL part (the neural network), welded together with a classical AI algorithm, e.g. Monte Carlo Tree Search, or some kind of theorem prover.

CyberDildonics · 2024-08-14T02:28:58 1723602538

What's the symbolic part? What does 'DL' mean here?

optimalsolver · 2024-08-15T21:43:41 1723758221

The symbolic part is the MCTS and/or theorem prover. DL is deep learning, i.e. the multilayered neural network. Neurosymbolic approaches form a significant branch of current machine learning. You can read up on all of it if you're actually curious and asking all this in good faith, but I suspect you're not.

tomsmeding · 2024-08-03T21:40:28 1722721228

Another relevant publication in this line of research: https://dl.acm.org/doi/10.1145/3632919

6gvONxR4sf7o · 2024-08-04T01:04:35 1722733475

I did some work on this about a decade ago, using RL on STLC, and that was the same problem I faced. It’s too bad so few well typed expression trees are very useful programs.

grnnja · 2024-08-03T22:54:10 1722725650

What the author is getting at is a pretty cool research area called program synthesis, where the goal is to create a program that satisfies a specification.

Most techniques are essentially brute force enumeration with tricks to improve performance, so they tend to struggle to find larger programs. A lot of active research is in improving performance.

Compared to asking a LLM to write a program, program synthesis approaches will guarantee that a solution will satisfy the specification which can be very powerful.

In particular, as the author has discovered, one area where program synthesis excels is finding small intricate bitwise operator heavy programs that can be hard to reason about as a human.

The most famous example of program synthesis is Microsoft's FlashFill, which is used in Excel. You give it a few input output examples and FlashFill will try to create a small program to generalize them, and you can apply the program to more inputs, which saves you a bunch of time. An example from the paper is:

  Input -> Output

  International Business Machines -> IBM
  Principles Of Programming Languages -> POPL
  International Conference on Software Engineering -> ICSE

  String Program: Loop(\w : Concatenate(SubStr2(v_1, UpperTok, w)))

Here's a few papers:

EUSOLVER: https://www.cis.upenn.edu/~alur/Tacas17.pdf

FlashFill: https://www.microsoft.com/en-us/research/wp-content/uploads/...

BlinkFill: https://www.vldb.org/pvldb/vol9/p816-singh.pdf

Synquid: https://cseweb.ucsd.edu/~npolikarpova/publications/pldi16.pd...

YeGoblynQueenne · 2024-08-03T23:52:14 1722729134

Note that FlashFill is an example of inductive program synthesis, i.e. program synthesis from an incomplete specification, e.g. one in the form of input-output examples, program traces, or natural language descriptions.

Program synthesis from a complete specification is known as deductive program synthesis and the simplest example is compilation of a program in a high-level language to a machine code. A "complete" specification is what it says on the tin: it fully specifies the program to be synthesised.

amritananda · 2024-08-04T00:04:24 1722729864

Program synthesis is also pretty much equivalent to generating proofs of propositions by the Curry-Howard Isomorphism. There was a post from a few days ago about using ML to generate proofs in Lean. I'm sure there's ongoing research to do the same thing with synthesis (which imo is probably going to be more effective at pruning the search space than brute force).

sashank_1509 · 2024-08-04T02:25:31 1722738331

Either I have become paranoid or this feels LLM generated

IIAOPSW · 2024-08-04T05:24:32 1722749072

or both

dleslie · 2024-08-03T17:08:45 1722704925

Might help to read some of the author's earlier work, like:

https://medium.com/@maiavictor/solving-the-mystery-behind-ab...

And the Omega documentation:

https://hackage.haskell.org/package/control-monad-omega-0.3/...

But I'll admit, I want to see a paper or code.

kccqzy · 2024-08-03T17:19:52 1722705592

I don't think the original submission has enough details for us to reproduce or even understand what's being done. Omega monad is just a diagonal search monad that supports infinity. I don't understand the syntax in the screenshot. What's the type of the terms being enumerated? I can see lambda, but what's @inc or the pluses and minuses?

trenchgun · 2024-08-04T11:08:10 1722769690

Yeah, also the submission headline is wrong. He was using Haskell only for the toy example. He is using HVM for the actual Discrete Program Search. See also the follow up post: https://x.com/VictorTaelin/status/1819774880130158663

mordechai9000 · 2024-08-03T17:13:16 1722705196

"Perhaps I'm being really naive and an exponential wall is about to humble (and embarrass) me."

There is something fascinating and awe-inspiring about searching very large, enumerative datasets for useful results. Like wandering Borge's Library of Babel in search of not just meaning but something true and useful.

diwank · 2024-08-03T18:26:12 1722709572

More on the Omega monad from its author:

http://web.archive.org/web/20210616105135/http://lukepalmer....

tadfisher · 2024-08-03T16:55:49 1722704149

I'm confused, what is the search space? All possible functions from Int -> Int? And how do you verify a result is optimal across inputs?

LightMachine · 2024-08-03T21:59:28 1722722368

The search space I'm using is that of all functions of a given dependent type. That allows you to make the search space by using a strong enough type.

For example, if you search for `Integer -> Integer -> Integer` function, it will consider Integers of different bit-sizes. But if you instead search for `∀(n: Nat). Int(n) -> Int(n) -> Int(n)`, you will only consider integers of the same bit-size, which is a much smaller space. You can make arbitrary restrictions to shorten your search.

YeGoblynQueenne · 2024-08-03T23:39:35 1722728375

So you're enumerating a search space of functions, with some constraints to keep the search tractable?

When you do this kind of thing, there are two worst cases:

a) Your constraints are too strong and the search space does not include the target.

b) The search space includes the target but it is too large to search in polynomial time.

How are you dealing with those?

To clarify, the happy case is when your search target is easy to find, i.e. when the search space is small and includes the target. But that happens... rarely (because program search spaces are large).

solomatov · 2024-08-03T21:05:10 1722719110

My guess is they sample closed lambda terms representing functions.

LightMachine · 2024-08-03T19:53:33 1722714813

Uhm author here. Not sure why this tweet is on Hacker News, as it is just a non-technical "blog post". But I've posted a follow-up today with some code and details, if you're curious:

https://x.com/VictorTaelin/status/1819774880130158663

That's as much as I can share for now though

smusamashah · 2024-08-03T20:45:05 1722717905

Only thing I could read (don't understand any of this otherwise) was that you take input pairs and you give python function to generate that output. Does that mean that many math etc problems can be just solved? What kind of python code will it generate to return primes?

LightMachine · 2024-08-03T21:55:17 1722722117

It will just return the smallest function that passes your tests.

It works by enumerating ALL possible functions and running them. Obviously, that naive approach is exponential, so, the entire point is whether we can apply some clever tricks (based on optimal evaluators) to make this search "slightly less intractable".

So, to answer your question directly: if you asked it to design a prime number generator, if it found anything at all, it would probably be a simple, slow trial-and-error algorithm, or something in these lines.

pyinstallwoes · 2024-08-03T21:30:13 1722720613

I imagine it like content addressable computation which implicitly means data too.

Why compute anything more than once? Find by identity and reuse.

This then opens up interesting things when applied in a network that can optimize pathing (like an internet, DHT-like overlay).

YeGoblynQueenne · 2024-08-03T23:46:55 1722728815

Hi LightMachine. I can't read Haskel (?). In this screenshot:

https://x.com/VictorTaelin/status/1819208143638831404/photo/...

What's the highlighted program (?) on line 1875997?

cvoss · 2024-08-04T01:40:56 1722735656

Did you know that there is an efficient algorithm for learning (guessing) a regular language from a series of examples/non-examples? Basically, you directly construct the smallest automaton that correctly discriminates the training inputs. It's Dana Angluin's L* algorithm.

The technique generalizes to finite tree automata, meaning you can discover generators of simple expression grammars, given examples/non-examples.

So if you can construe your problem as a labeled tree recognition problem, assuming it's solvable using a finite tree automaton, then you can discover the algorithm for it efficiently.

For example, if you have three strings of bits (2 inputs and 1 output) and line them up as a single string of triplets of bits, it does not surprise me that it is easy to discover the automaton state transition rules that give you the carry bit for each triple and tell you when to reject the triple because the output bit is not correct for the two input bits.

The author has arranged the problem this way when sketching the ADC template and has also jump-started the search by assuming the solution space includes exactly one output bit for each pair of input bits. (That may seem like an obvious necessity, but that is a constraint which is not required by the tree automaton formulation, which need not differentiate "inputs" and "outputs".)

treyd · 2024-08-03T16:56:02 1722704162

I'm confused about what this could be, with a title as technical as this but being posted somewhere with the signal:noise of Twitter, also where I can't view it.

082349872349872 · 2024-08-03T18:58:53 1722711533

For non-twitter, I was looking the other day at https://reasonablypolymorphic.com/blog/syntax-guided-synthes... , which unfortunately points at a dead page related to https://sygus.org/comp/2019/ , but seems related?

pbib · 2024-08-03T19:26:43 1722713203

I believe the dead link in the article is referencing this journal paper[1], which is an expanded version of the original syntax guided synthesis paper.

[1] https://web.eecs.umich.edu/~weimerw/2022-481W/readings/Alur1...

082349872349872 · 2024-08-03T19:39:36 1722713976

Thanks! Looks like barliman is also (like its namesake?) slowly progressing: https://github.com/webyrd/Barliman

sporkl · 2024-08-03T20:12:02 1722715922

This kind of reminds me of what happens when you implement an interpreter in a relational programming language, which lets you do cool stuff like generating quines by specifying that the program and it’s output should be the same.

Quick search lead to this paper which is exactly what I’m talking about: http://webyrd.net/quines/quines.pdf

YeGoblynQueenne · 2024-08-03T23:43:16 1722728596

William Byrd is the creator of miniKanren btw. And an HN user.

verdverm · 2024-08-03T20:25:35 1722716735

Super interesting. Sounds like the algo I created Prioritized Grammar Enumeration.

https://seminars.math.binghamton.edu/ComboSem/worm-chiu.pge_...

PGE is essentially a BFS/DFS traversal of the space of all formulas, by using local enumeration of the AST. The biggest gains where from eliminating duplicate work (commutativity / associativity) and not going down bad branches (too much complexity added for no meaningful change to output). A lot of overlap in ideas here, and a lot of open research questions that could be worked on (like can use use RL to help guide the search like A*). There's definitely an exponential explosion or wall as the AST gets wider / deeper.

At one point I wrote the core algorithm in Haskell, which made it so much more concise and beautiful, but eventually landed on python https://github.com/verdverm/pypge

In all of Genetic Programming / Symbolic Regression, everyone starts by trying to generate computer code and then switches to just math formula. They are different classes of problems because code has more "genes" and is order sensitive, whereas math is not

garyrob · 2024-08-03T17:28:39 1722706119

Maybe this could be used with the Evolutionary Algorithm subset known as Genetic Programming, originally popularized by John Koza in the 80s?

anamax · 2024-08-05T08:19:14 1722845954

Doug Lenat's AM (https://en.wikipedia.org/wiki/Automated_Mathematician) seems relevant.

rowanG077 · 2024-08-03T23:44:06 1722728646

My intuition tells me you will quickly run out of memory on anything that isn't just a small function. Your state space will just be so big that your super positions will become too large.

jmull · 2024-08-03T22:44:38 1722725078

It's kind of vague, but... it sounds like this is essentially a breadth first search (where the graph being searched is represented by a grammar)?

Hard to tell.