The TL;DR is that this appears to beat both OpenAI Codex (12 billion params) and DeepMind's AlphaCode (1 billion params), despite having only 770M parameters.
There is also a paper: https://arxiv.org/abs/2207.01780
So this seems to buck the trend of needing enormous models, by being smarter about taking into account the output from running the AI-generated program on test cases.
I would rather say that there isn't a trend of needing enormous models but rather a trend of observing that enormous models (and straightforward increases in size) work. We've already seen many other tricks applied for transformer models (essentially, most of "BERTology") that demonstrated that being smarter about taking into account some stuff in training can get a big benefit, but that is all orthogonal to the size issue.
That trend would be bucked only if it turned out that for this task 770M is sufficient, i.e. that scaling this 'being smarter' approach from 770M parameters to the 12B parameters of OpenAI Codex does not get a further improvement - however, I am quite convinced that it would be noticeably better than this; and my expectations would be that the eventual outcome of this technique would be that OpenAI Codex v2 (or however that would be called) would beat this by having a model that includes this technique and some other improvements to training process and increases the size beyond 12B parameters.
> So this seems to buck the trend of needing enormous models, by being smarter about taking into account the output from running the AI-generated program on test cases.
i'd be curious if it finds degenerate solutions and what they look like when the test cases aren't conceptually complete...
"AI-powered code editors"? Hmmm. The first I heard of this idea was about a week ago watching Peter Zeihan dismantle the techo-utopians at https://ark-invest.com/. It seems like one of those future things (like fusion and flying cars) that are always going to remain in the future. I'll stick with boring old vim!
I don’t know what any of those things are, but this idea is both years old and already implemented by several companies (e.g. MS Copilot). Bury your head in the sand based on the signaling of elites all you want. Or you could add the Copilot configuration to your .vimrc and see for yourself…
Thank you. I watched the video. Zeihan’s arguments I mostly can’t comment on, but it’s intriguing how he thinks the supply gap of programmers will be fixed by gen-z entering the software industry. I’m expecting AI to fill the gap. Yes, LLMs currently kind of suck at program synthesis. But the tech will get better. AGI may still be far off, Metaculus’ 2030 centered predictions may be wrong. Scalable, human level, program synthesis may be far off.
But can it improve enough to reduce the demand for programming talent by, all else being equal, 20%? 25%? That seems doable within the decade.
When I see a title with "toward[s]" in it, the first thing I wonder about is: why "toward[s]"? Why aren't we there yet?
Scrolling down to "Performance Results" it's obvious that the results of this approach are creeping upwards at a snail's pace. 40% correct programs when the model is given 1000 "guesses" (one thousand) is the best result out of all reported results. It's the result for the simplest, "intro"-level, problems.
Fromt here, results get progressively worse for harder categories, like harder problem sets or "1@k" measurements, where the first program generated by a model is evaluated. For "intro" problems that's about 7%.
NB: that's accuracy. Not error. 7% accuracy. 7% of the time the model got the coding problem right first try.
Despite the slow creep upwards, this is really bad for this kind of approach. The main problem of Large Language Models as code generators is that they can generate code, but they don't know if it satisfies a specification (in this case, given by a combination of natural language text and input/output examples). The proposed approach (using reinforcement learning to capture, and use, the signal from unit tests) is supposed to address that precise limitation, but the results show that it just doesn't do very well at all. Not even when the system is allowed 1000 (one thousand) "guesses".
Why not? Because if you start with a code generator that generates almost (not quite) random code, you end up with almost (not always) incorrect programs. After that it doesn't matter how well you "filter" for correct results, you're unlikely to have any in your generated set. The solution? Generate fewer programs that are more likely to be correct. How to do that with a LLM? Who knows.
And a small correction to the article:
>> Recent advances in deep learning, such as pretrained language models (LMs), have led to remarkable progress in program synthesis.
These "advances" in deep learning have led to remarkable progress in neural program synthesis. The approach proposed here, combining a generator with a tester, is certainly not a "remarkable advance" for the broader program synthesis field. Rather it is a primitive approach that has been tried and refined continuously since the 1970's or so. The program synthesis field has gone a lot further than what's described in the article. I'll point to Gulwani et al. again for a recent overview of the field:
Program Synthesis, Sumit Gulwani, Oleksandr Polozov, Rishabh Singh
Other details noted in the article about "program synthesis" (e.g. the poor performance on complex problems) also only applies to neural program synthesis.
For example, the following paper includes an experiment learning a program that identifies a palindrome (this is the hard part in the length-of-a-not-palindrome example in the article) together with other similarly complex programs (i.e. ones that are best expressed recursively):
Learning Higher-Order Programs without Meta-Interpretive Learning, Stanisław J. Purgał, David M. Cerna, Cezary Kaliszyk
Is that a fair reading of my comment? I used "2%" as shorthand for "an increase of two points of accuracy score".
When the starting accuracy is 5%, saying that it was improved by 0.4 is not very elucidating. The point is that the approach is still very bad at generating correct programs and any improvement to the already abysmal state of the art is tiny and possibly insignificant.
It's like the old joke: the USA president challenges the General Secretary of the USSR in a 100m race. The USA president finishes first. Next day, the Pravda goes out with the title: "100m race: General Secretary finishes second. American President finishes before last".
> The point is that the approach is still very bad at generating correct programs
It is current state of art approach. Nothing better is developed according to that benchmark. What exactly you are complaining about and proposing? Unpublish paper and results?
That's not right. The approach described in the article and the linked paper is
not "state of the art". It beats one benchmark which has only been used to
compare one kind of system: neural program synthesis systems (Codex, AlphaCode,
GPT-j and the one in the article, CodeRL). It almost completely fails to
acknowledge the wider field of program synthesis that has existed long before
that benchmark. The benchmark is ad hoc and arbitrary and it tells us nothing
about the general capability of the compared systems, except that they are still
very bad at that particular benchmark.
I linked the Gulwani report above because it is a good, recent introduction to
the field of program synthesis research which is still dominated by non-neural
approaches and for very good reasons. I linked to another paper that shows an
example of a system of the non-neural kind that is common in program synthesis.
I pointed out that regardless of the one and only benchmark on which the
LLM-based systems have been tested that are listed in the article above, the
approach of code generation by LLM is primitive compared to the sophisticated
search and testing strategies developed by the wider prorgam sythesis community
over the years. Again, see Gulwani at al. for examples.
The proposed approach is also primitive compared to existing, earlier neural
program synthesis approaches, i.e. program synthesis approaches that do use a
neural network, but not an LLM, for example DreamCoder
[https://arxiv.org/abs/2006.08381] or everything from Dawn Song's group
[https://sunblaze-ucb.github.io/program-synthesis/index.html] - none of which
I'm affiliated with in any way, btw.
If you're looking for the state of the art in program synthesis, look elsewhere
than LLM-based systems like the one in the article above. Even if you're looking
for the state of the art in neural program synthesis, look elsewhere. What is
described in the article above is a first, faltering step in a direction that
may still yield some good results in the future. Or it may not. But it's nothing
impressive for the time being.
>> Unpublish paper and results?
Gulwani et al is a technical report but it's a staple reference in the field,
and an easy introduction to outsiders. As far as I can tell, neither AlphaCode,
nor CodeRL (the Salesforce model described in the article) have been the subject
of published work. The article above is linking to an arxiv preprint.
> It beats one benchmark which has only been used to compare one kind of system: neural program synthesis systems (Codex, AlphaCode, GPT-j and the one in the article, CodeRL).
yes, that's what they call "state of the art", winning one benchmark is enough to call result SOTA.
> The benchmark is ad hoc and arbitrary and it tells us nothing about the general capability of the compared systems, except that they are still very bad at that particular benchmark.
so, what benchmark is better in your opinion, why and what systems demonstrate strong results there?
> Gulwani et al is a technical report but it's a staple reference in the field, and an easy introduction to outsider
sorry, I don't understand why you keep referencing on that report. It looks 5 years old, thus outdated. What value it has in your opinion exactly?
The TL;DR is that this appears to beat both OpenAI Codex (12 billion params) and DeepMind's AlphaCode (1 billion params), despite having only 770M parameters. There is also a paper: https://arxiv.org/abs/2207.01780
And the code is available: https://github.com/salesforce/CodeRL
So this seems to buck the trend of needing enormous models, by being smarter about taking into account the output from running the AI-generated program on test cases.