Is that a fair reading of my comment? I used "2%" as shorthand for "an increase of two points of accuracy score".
When the starting accuracy is 5%, saying that it was improved by 0.4 is not very elucidating. The point is that the approach is still very bad at generating correct programs and any improvement to the already abysmal state of the art is tiny and possibly insignificant.
It's like the old joke: the USA president challenges the General Secretary of the USSR in a 100m race. The USA president finishes first. Next day, the Pravda goes out with the title: "100m race: General Secretary finishes second. American President finishes before last".
> The point is that the approach is still very bad at generating correct programs
It is current state of art approach. Nothing better is developed according to that benchmark. What exactly you are complaining about and proposing? Unpublish paper and results?
That's not right. The approach described in the article and the linked paper is
not "state of the art". It beats one benchmark which has only been used to
compare one kind of system: neural program synthesis systems (Codex, AlphaCode,
GPT-j and the one in the article, CodeRL). It almost completely fails to
acknowledge the wider field of program synthesis that has existed long before
that benchmark. The benchmark is ad hoc and arbitrary and it tells us nothing
about the general capability of the compared systems, except that they are still
very bad at that particular benchmark.
I linked the Gulwani report above because it is a good, recent introduction to
the field of program synthesis research which is still dominated by non-neural
approaches and for very good reasons. I linked to another paper that shows an
example of a system of the non-neural kind that is common in program synthesis.
I pointed out that regardless of the one and only benchmark on which the
LLM-based systems have been tested that are listed in the article above, the
approach of code generation by LLM is primitive compared to the sophisticated
search and testing strategies developed by the wider prorgam sythesis community
over the years. Again, see Gulwani at al. for examples.
The proposed approach is also primitive compared to existing, earlier neural
program synthesis approaches, i.e. program synthesis approaches that do use a
neural network, but not an LLM, for example DreamCoder
[https://arxiv.org/abs/2006.08381] or everything from Dawn Song's group
[https://sunblaze-ucb.github.io/program-synthesis/index.html] - none of which
I'm affiliated with in any way, btw.
If you're looking for the state of the art in program synthesis, look elsewhere
than LLM-based systems like the one in the article above. Even if you're looking
for the state of the art in neural program synthesis, look elsewhere. What is
described in the article above is a first, faltering step in a direction that
may still yield some good results in the future. Or it may not. But it's nothing
impressive for the time being.
>> Unpublish paper and results?
Gulwani et al is a technical report but it's a staple reference in the field,
and an easy introduction to outsiders. As far as I can tell, neither AlphaCode,
nor CodeRL (the Salesforce model described in the article) have been the subject
of published work. The article above is linking to an arxiv preprint.
> It beats one benchmark which has only been used to compare one kind of system: neural program synthesis systems (Codex, AlphaCode, GPT-j and the one in the article, CodeRL).
yes, that's what they call "state of the art", winning one benchmark is enough to call result SOTA.
> The benchmark is ad hoc and arbitrary and it tells us nothing about the general capability of the compared systems, except that they are still very bad at that particular benchmark.
so, what benchmark is better in your opinion, why and what systems demonstrate strong results there?
> Gulwani et al is a technical report but it's a staple reference in the field, and an easy introduction to outsider
sorry, I don't understand why you keep referencing on that report. It looks 5 years old, thus outdated. What value it has in your opinion exactly?
When the starting accuracy is 5%, saying that it was improved by 0.4 is not very elucidating. The point is that the approach is still very bad at generating correct programs and any improvement to the already abysmal state of the art is tiny and possibly insignificant.
It's like the old joke: the USA president challenges the General Secretary of the USSR in a 100m race. The USA president finishes first. Next day, the Pravda goes out with the title: "100m race: General Secretary finishes second. American President finishes before last".