AI Coding with CodeRL: Toward Mastering Program Synthesis with Deep RL

moyix · on July 28, 2022

Reposting my commentary:

The TL;DR is that this appears to beat both OpenAI Codex (12 billion params) and DeepMind's AlphaCode (1 billion params), despite having only 770M parameters. There is also a paper: https://arxiv.org/abs/2207.01780

And the code is available: https://github.com/salesforce/CodeRL

So this seems to buck the trend of needing enormous models, by being smarter about taking into account the output from running the AI-generated program on test cases.

PeterisP · on July 28, 2022

> the trend of needing enormous models

I would rather say that there isn't a trend of needing enormous models but rather a trend of observing that enormous models (and straightforward increases in size) work. We've already seen many other tricks applied for transformer models (essentially, most of "BERTology") that demonstrated that being smarter about taking into account some stuff in training can get a big benefit, but that is all orthogonal to the size issue.

That trend would be bucked only if it turned out that for this task 770M is sufficient, i.e. that scaling this 'being smarter' approach from 770M parameters to the 12B parameters of OpenAI Codex does not get a further improvement - however, I am quite convinced that it would be noticeably better than this; and my expectations would be that the eventual outcome of this technique would be that OpenAI Codex v2 (or however that would be called) would beat this by having a model that includes this technique and some other improvements to training process and increases the size beyond 12B parameters.

a-dub · on July 28, 2022

> So this seems to buck the trend of needing enormous models, by being smarter about taking into account the output from running the AI-generated program on test cases.

i'd be curious if it finds degenerate solutions and what they look like when the test cases aren't conceptually complete...

algo_trader · on July 28, 2022

EDIT: ha! turns out I am following you on twitter

https://twitter.com/moyix/status/1551932519796408321

Please release this as a docker or something.. Ask the AI to do it for you

moyix · on July 28, 2022

I'm working on integrating it with a VSCode plugin right now :) After that I'll write up a post and try to release it as something easy to use!

algo_trader · on July 28, 2022

RL is an "obvious" extension to code generating LLMs

Try to compile, try to run, estimate reward, iterate.

Possibly there is a cost issue, because sampling LLMs is more expensive than running starcraft?!

lotw_dot_site · on July 28, 2022

"AI-powered code editors"? Hmmm. The first I heard of this idea was about a week ago watching Peter Zeihan dismantle the techo-utopians at https://ark-invest.com/. It seems like one of those future things (like fusion and flying cars) that are always going to remain in the future. I'll stick with boring old vim!

ShamelessC · on July 29, 2022

I don’t know what any of those things are, but this idea is both years old and already implemented by several companies (e.g. MS Copilot). Bury your head in the sand based on the signaling of elites all you want. Or you could add the Copilot configuration to your .vimrc and see for yourself…

ShamelessC · on July 29, 2022

w.r.t. this investment firm, they seem all in on cryptocurrency so yeah - probably full of shit.

qualudeheart · on July 28, 2022

Where does he do that? Do you have a link? I’m interested.

losteric · on July 29, 2022

Possibly referencing https://www.youtube.com/watch?v=t_gw4eTr-hM (discussion: https://www.reddit.com/r/ArkInvestorsClub/comments/uv278a/wh...)

qualudeheart · on July 29, 2022

Thank you. I watched the video. Zeihan’s arguments I mostly can’t comment on, but it’s intriguing how he thinks the supply gap of programmers will be fixed by gen-z entering the software industry. I’m expecting AI to fill the gap. Yes, LLMs currently kind of suck at program synthesis. But the tech will get better. AGI may still be far off, Metaculus’ 2030 centered predictions may be wrong. Scalable, human level, program synthesis may be far off. But can it improve enough to reduce the demand for programming talent by, all else being equal, 20%? 25%? That seems doable within the decade.

ausbah · on July 29, 2022

I think it's more likely platforms services and low/no code offerings will fill the gap

YeGoblynQueenne · on July 28, 2022

When I see a title with "toward[s]" in it, the first thing I wonder about is: why "toward[s]"? Why aren't we there yet?

Scrolling down to "Performance Results" it's obvious that the results of this approach are creeping upwards at a snail's pace. 40% correct programs when the model is given 1000 "guesses" (one thousand) is the best result out of all reported results. It's the result for the simplest, "intro"-level, problems.

Fromt here, results get progressively worse for harder categories, like harder problem sets or "1@k" measurements, where the first program generated by a model is evaluated. For "intro" problems that's about 7%.

NB: that's accuracy. Not error. 7% accuracy. 7% of the time the model got the coding problem right first try.

Despite the slow creep upwards, this is really bad for this kind of approach. The main problem of Large Language Models as code generators is that they can generate code, but they don't know if it satisfies a specification (in this case, given by a combination of natural language text and input/output examples). The proposed approach (using reinforcement learning to capture, and use, the signal from unit tests) is supposed to address that precise limitation, but the results show that it just doesn't do very well at all. Not even when the system is allowed 1000 (one thousand) "guesses".

Why not? Because if you start with a code generator that generates almost (not quite) random code, you end up with almost (not always) incorrect programs. After that it doesn't matter how well you "filter" for correct results, you're unlikely to have any in your generated set. The solution? Generate fewer programs that are more likely to be correct. How to do that with a LLM? Who knows.

And a small correction to the article:

>> Recent advances in deep learning, such as pretrained language models (LMs), have led to remarkable progress in program synthesis.

These "advances" in deep learning have led to remarkable progress in neural program synthesis. The approach proposed here, combining a generator with a tester, is certainly not a "remarkable advance" for the broader program synthesis field. Rather it is a primitive approach that has been tried and refined continuously since the 1970's or so. The program synthesis field has gone a lot further than what's described in the article. I'll point to Gulwani et al. again for a recent overview of the field:

Program Synthesis, Sumit Gulwani, Oleksandr Polozov, Rishabh Singh

https://www.semanticscholar.org/paper/Program-Synthesis-Gulw...

Other details noted in the article about "program synthesis" (e.g. the poor performance on complex problems) also only applies to neural program synthesis.

For example, the following paper includes an experiment learning a program that identifies a palindrome (this is the hard part in the length-of-a-not-palindrome example in the article) together with other similarly complex programs (i.e. ones that are best expressed recursively):

Learning Higher-Order Programs without Meta-Interpretive Learning, Stanisław J. Purgał, David M. Cerna, Cezary Kaliszyk

https://arxiv.org/abs/2112.14603v1

riku_iki · on July 28, 2022

> is supposed to addrss that precise limitation, but the results show that it just doesn't do very well at all.

they demonstrate improvement from previous 5% to 7%.

YeGoblynQueenne · on July 28, 2022

Yes, that's a 2% improvement in accuracy.

riku_iki · on July 28, 2022

that's (7-5)/5 = 40% improvement in accuracy.

YeGoblynQueenne · on July 28, 2022

Is that a fair reading of my comment? I used "2%" as shorthand for "an increase of two points of accuracy score".

When the starting accuracy is 5%, saying that it was improved by 0.4 is not very elucidating. The point is that the approach is still very bad at generating correct programs and any improvement to the already abysmal state of the art is tiny and possibly insignificant.

It's like the old joke: the USA president challenges the General Secretary of the USSR in a 100m race. The USA president finishes first. Next day, the Pravda goes out with the title: "100m race: General Secretary finishes second. American President finishes before last".

riku_iki · on July 28, 2022

> The point is that the approach is still very bad at generating correct programs

It is current state of art approach. Nothing better is developed according to that benchmark. What exactly you are complaining about and proposing? Unpublish paper and results?

YeGoblynQueenne · on July 28, 2022

>> It is current state of art approach.

That's not right. The approach described in the article and the linked paper is not "state of the art". It beats one benchmark which has only been used to compare one kind of system: neural program synthesis systems (Codex, AlphaCode, GPT-j and the one in the article, CodeRL). It almost completely fails to acknowledge the wider field of program synthesis that has existed long before that benchmark. The benchmark is ad hoc and arbitrary and it tells us nothing about the general capability of the compared systems, except that they are still very bad at that particular benchmark.

I linked the Gulwani report above because it is a good, recent introduction to the field of program synthesis research which is still dominated by non-neural approaches and for very good reasons. I linked to another paper that shows an example of a system of the non-neural kind that is common in program synthesis.

I pointed out that regardless of the one and only benchmark on which the LLM-based systems have been tested that are listed in the article above, the approach of code generation by LLM is primitive compared to the sophisticated search and testing strategies developed by the wider prorgam sythesis community over the years. Again, see Gulwani at al. for examples.

The proposed approach is also primitive compared to existing, earlier neural program synthesis approaches, i.e. program synthesis approaches that do use a neural network, but not an LLM, for example DreamCoder [https://arxiv.org/abs/2006.08381] or everything from Dawn Song's group [https://sunblaze-ucb.github.io/program-synthesis/index.html] - none of which I'm affiliated with in any way, btw.

If you're looking for the state of the art in program synthesis, look elsewhere than LLM-based systems like the one in the article above. Even if you're looking for the state of the art in neural program synthesis, look elsewhere. What is described in the article above is a first, faltering step in a direction that may still yield some good results in the future. Or it may not. But it's nothing impressive for the time being.

>> Unpublish paper and results?

Gulwani et al is a technical report but it's a staple reference in the field, and an easy introduction to outsiders. As far as I can tell, neither AlphaCode, nor CodeRL (the Salesforce model described in the article) have been the subject of published work. The article above is linking to an arxiv preprint.

riku_iki · on July 28, 2022

> It beats one benchmark which has only been used to compare one kind of system: neural program synthesis systems (Codex, AlphaCode, GPT-j and the one in the article, CodeRL).

yes, that's what they call "state of the art", winning one benchmark is enough to call result SOTA.

> The benchmark is ad hoc and arbitrary and it tells us nothing about the general capability of the compared systems, except that they are still very bad at that particular benchmark.

so, what benchmark is better in your opinion, why and what systems demonstrate strong results there?

> Gulwani et al is a technical report but it's a staple reference in the field, and an easy introduction to outsider

sorry, I don't understand why you keep referencing on that report. It looks 5 years old, thus outdated. What value it has in your opinion exactly?