Hacker News new | past | comments | ask | show | jobs | submit login
Competitive Programming with AlphaCode (deepmind.com)
678 points by yigitdemirag on Feb 2, 2022 | hide | past | favorite | 397 comments



It never ceases to amaze me what you can do with these transformer models. They created millions of potential solutions for each problem, used the provided examples for the problems to filter out 99% of incorrect solutions and then applied some more heuristics and the 10 available submissions to try to find a solution.

All these approaches just seem like brute-force approaches: Let's just throw our transformer on this problem and see if we can get anything useful out of this.

Whatever it is, you can't deny that these unsupervised models learn some semantic representations, but we have no clue at all what that actually is and how these model learn that. But I'm also very sceptical that you can actually get anywhere close to human (expert) capability in any sufficiently complex domain by using this approach.


>> filter out 99% of incorrect solutions

And next year they can filter out 99.99%. And the year after that, 99.9999%. So literally, an exponentially greater number of monkey/typewriting units. (An AI produced Shakespeare play coming soon).

>> we have no clue at all what that actually is and how these model learn

This is why I'm super cool-to-cold about the AI/deep learning classes being sold to young people who would otherwise be learning fundamental programming skills. It appears to me like trying to teach someone to ride a horse before they understand what skin, bones, muscles, animals, and horses are.

>>get anywhere close to human (expert) capability in any sufficiently complex domain

You can get close enough to scalp a lot of billionaires, but at the end of the day it's always going to be human coders banging our heads against management, where they ask for shit they can't visualize and it's our job to visualize how their employees/customers will use it. Yes it involves domain specific knowledge, but it also requires, er, having eyeballs and fingers, and understanding how a biological organism uses a silicon-based device. That's kind of the ultimate DS knowledge, after all. Now, lots of coders just copy-pasta a front end, but after all the hooplah here I'd be extremely surprised if in ten years an AI has caught up to your basic web mill in Indonesia when it comes to building a decent website.


Surely if your discrimintator gets orders of magnitude better like your describing, we could train the transformer GAN style, and reduce the dependence on generating so many examples to throw away.


i like that you drew a connection with monkeys on typewriters.


Another way to frame it is that these models still perform very poorly at the task they're designed to do. Imagine if real programmer needed to write a solution a hundred times before they were able to achieve (average) performance. You'd probably wonder if it was just blind luck that got them to the solution. You'd also fire them. What these models are very good at doing is plagiarizing content, so part of me wonders if they aren't just copying previous solutions with slight adjustments.


> Imagine if real programmer needed to write a solution a hundred times

To be fair, a lot of creative work requires plenty of trial and error. And since no problems are solved from scratch, all things considered, the most immediate contributors to your result and you might have iterated through tens of dozens of possibilities.

My advantage as a human is I can often tell you why I am eliminating this branch of the search space. The catch is my reasoning can be flawed. But we do ok.

> just copying previous solutions with slight adjustments.

It's not just doing that, Copilot can do a workable job providing suggestions for an invented DSL. A better analogy than autocomplete is inpainting missing or corrupted details based on a surrounding context. Except instead of a painting we are probabilistically filling in patterns common in solutions to leetcode style problems. Novelty beyond slight adjustments comes in when constraints are insufficient to pin down a problem to a known combination of concepts. The intelligence of the model is then how appropriate its best guesses are.

The limitations to GPT3 codex and AlphaCode seems to be they're relatively weak at selection and that they require problem spaces with enough data to distill a sketch of and how to inpaint well in them. Leetcode style puzzles are constructed to be soluble in a reasonable number of lines, are not open ended and have a trick to them. One can complain that while we're closer to real world utility, we're still restricted to the closed worlds of verbose apis, games and puzzles.

While lots of commenters seem concerned about jobs, I look forward to having the dataset oliphaunt and ship computer from Fire Upon Deep someday soon.


>> While lots of commenters seem concerned about jobs, I look forward to having the dataset oliphaunt and ship computer from Fire Upon Deep someday soon.

I think this is more worthy of debate than anything about DSL models or current limits to problem spaces.

I'm not concerned about my job, but I am concerned about a world where corporate money starts shifting toward managing AIs as beasts rather than coding clever solutions. I'm concerned about it because (1) It has always been possible in theory to invent an infinite number of solutions and narrow them down, if you have the processing power, to those that "work", but, this leaves us in a position where we don't understand the code we're running (as a society) or how to fix it (as individuals). And (2) because learning to manage an elephant, as a beast, is utterly different from learning to build an elephant, and it will lead to a dumbing-down of people entering the trade. In turn, they'll become more reliant on things just working the way they're expected to work. This is a very negative cycle for humanity as a whole.

Given the thing you're looking forward to, it's only about 30 years before no one can write code at all; worse, no one will know how to fix a broken machine. I don't think that's the thing we should advocate for.


"Understanding the code" might not be that big of a deal as you might think -- we have this problem today already. A talented coder might leave the company and the employer may not be able to hire a replacement who's as good. Now they have to deal with some magic in the codebase. I don't hear people giving advice not to hire smart people.

At least with AI, you can (presumably) replicate the results if you re-run everything from the same state.

There's also a very interesting paragraph in the paper (I'm in no position to judge whether it's valid or not) that touches on this subject, but with a positive twist :

Interpretability. One major advantage of code generation models is that code itself is relatively interpretable. Understanding the behavior of neural networks is challenging, but the code that code generation models output is human readable and can be analysed by traditional methods (and is therefore easier to trust). Proving a sorting algorithm is correct is usually easier than proving a network will sort numbers correctly in all cases. Interpretability makes code generation safer for real-world environments and for fairer machine learning. We can examine code written by a human-readable code generation system for bias, and understand the decisions it makes.


> Now they have to deal with some magic in the codebase. I don't hear people giving advice not to hire smart people.

People do advise against hiring people who write incomprehensible code.

Yeah every now and then you run across some genius with sloppy code style and you have to confine them to a module that you'll mark "you're not expected to understand this" when they leave because they're really that much of a genius, but usually the smart people are smart enough to write readable code.


>The limitations to GPT3 codex and AlphaCode seems to be they're relatively weak at selection

This really does seem like the key here--the knowledge apparently is all in the language model, we just haven't found the best ways to extract that knowledge in a consistent and coherent manner. Right now it's just: generate a bunch of examples and cherry pick the good ones.


The way you put it sounds so like the P?=NP problem:

If it's easy to tell whether a solution is valid, is it also easy to generate it?


How do you know the inner workings of the mind don't operate in a similar manner? How many different solutions to the problem are constructed within your mind before the correct one 'just arrives'?


I suspect there is some similarity between language models and the structure of language in the mind, but there's a whole lot more going on behind the scenes in the brain than simple runtime statistical model output. Intentionality, planning, narrativity, memory formation, object permanence... Language models are exciting and interesting because apparently they can do abstract symbolic manipulation and produce coherent text, but I wouldn't call AGI solved quite yet.


I was really impressed with a lot of the GPT3 stuff I had seen people showing so I gave it a spin myself. I was surprised by how repetitive it seemed to be, it would write new sentences but it would repeat the same concepts among similar prompts. I wish I saved the examples, it was like when a chat bot gets in a loop but GPT3 varied the sentence structure. I think that if you look closely at transformer models outputs you can expect the same sort of thing. Its like in high school when people would copy homework but use different wording.

I also think generally in ML and DL the overarching progress gets hyped but in the background there are murmurs about the limitations in the research community. Thats how we end up with people in 2012 saying FSD is a couple years away but in 2022 we know we aren't even close yet. We tend to oversell how capable these systems are.


Id be shocked if people pitching startups and research grants etc all started saying "yeah this stuff isn't going to work for a couple of decades in any kind of sustainable manner" even if these types of unknowable unknowns were known.


They specifically stated that they tested it on 10 challenges that were newer than their training data, so it couldn’t just be plagiarizing content.


What do you think then is the difference between going from 50th to 99.9th percentile in their other domains? Is there something materially different between ago, protein folding, or coding? (I don’t know the answer, just curious if anyone else does)


>> What do you think then is the difference between going from 50th to 99.9th percentile in their other domains? Is there something materially different between ago, protein folding, or coding?

Yes, it's the size of the search space for each problem. The search space for arbitrary programs in a language with Universal Turing Machine expressivity is infinite. Even worse, for any programming problem there are an infinite number of candidate programs that may or may not solve it and that differ in only minute ways from each other.

For Go and protein structure prediction from sequences the search space is finite, although obviously not small. So there is a huge difference in the complexity of the problems right there.

Btw, I note yet again that AlphaCode performs abysmally badly on the formal benchmark included in the arxiv preprint (see Section 5.4, and table 10). That makes sense because AlphaCode is a very dumb generate-and-test, brute-force search approach that doesn't even try to be smart and tries to make up for the lack of intelligence with an awesome amount of computational resources. Most work in program synthesis is also basically a search through the space of programs, but people in the field have come up with sophisticated techniques to avoid having to search an infinite number of programs- and to avoid having to generate millions of program candidates, like DeepMind actually brags about:

At evaluation time, we create a massive amount of C++ and Python programs for each problem, orders of magnitude larger than previous work.

They say that as if generating "orders of magnitude more" progams than previous work is a good thing, but it's not. It means their system is extremely bad at generating correct programs. It is orders of magnitude worse than earlier systems, in fact.

(The arxiv paper linked from the article quantifies this "massive" amount as "millions"; see Section 4.4).


Well with respect to Go the fundamental difference afaict is that you can apply self-supervised learning, which is an incredibly powerful approach (But note e.g. that even this approach wasn't successful in "solving" Starcraft). Unfortunately it's extremely difficult to frame real-world problems in that setting. I don't know anything about protein-folding and don't know what Deepmind uses to try to solve that problem, so I cannot comment on that.


> this approach wasn't successful in "solving" Starcraft)

Why do you say that? As I understand it, AlphaStar beat pros consistently, including a not widely reported showmatch against Serral when he was BlizzCon champ.


Two possible reasons.

1. First, though I am not sure of this (i.e. this should be verified), I heard that the team working on AlphaStar initially tried to create a Starcraft AI entirely through "self-play," but this was not successful. (Intuitively, in a real-time game, there are too many bad options too early on that even with a LOT of time to learn, if your approach is too "random" you will quickly enter an unwinnable position and not learn anything useful.) As a result, they replaced this approach with an approach which incorporated learning from human games.

2. "including a not widely reported showmatch against Serral when he was BlizzCon champ." is a mischaracterization. It was not a "showmatch," rather there was a setup at Blizzcon where anyone could sit down and play against AlphaStar, and Serral at some point sat down to play AlphaStar there. He went 0-4 vs AlphaStar's protoss and zerg, and 1-0 vs its Terran. However, not only was he not using his own keyboard and mouse, but he could not use any custom hotkeys. If you do not play Starcraft it may not be obvious just how large of a difference this could make. BTW, when Serral played (perhaps an earlier iteration of) AlphaStar's terran on the SC2 ladder, he demolished it.

I remember when seeing the final report, I was a bit disappointed. It seemed like they cut the project off at a strange point, before AlphaStar was clearly better than humans. I feel that if they had continued they could have gotten to that point, but now we will never know.


"It seemed like they cut the project off at a strange point, before AlphaStar was clearly better than humans. I feel that if they had continued they could have gotten to that point" What if that's why they cut it off..


Apologies, I don't quite follow your reasoning.


I think the GP means that the AlphaStar team stopped working on the project because they felt it was reaching a dead end and unlikely to produce further results, or at least other ventures might have been more promising.

I think that's most likely the case too, otherwise why would they give up?


I guess I feel that there is a big discontinuous jump between "not clearly better than humans" and "clearly better than humans," where the latter is much, much more significant than the former. It seems like going on a hike and stopping before the summit.


> but he could not use any custom hotkeys.

IIRC you could and Serral did set his own custom keybindings on the machine. The main difference was different keyboard and mouse.


I looked into this again and the hotkey situation seems more unclear than I suggested. You could not log into your Battle.net account, so it would have been somewhat time consuming to change all of your settings manually. If I had to guess, I might wager that Serral changed some of the more important ones manually but not the others, but this is just conjecture and maybe he changed all of them. I don't know if anyone but Serral would know this, however.

In any case, Serral said this, which you can take as you will:

https://twitter.com/ENCE_Serral/status/1192023800961019904

"It was okay, I doubt i would lose too many games with a proper setup. I think the 6.3-6.4 mmr is pretty accurate, so not bad at all but nothing special at the same time."

On the one hand, surely it doesn't seem surprising that the player who lost, the human, would say the above, and so one may be skeptical of how unbiased Serral's assessment is. On the other hand, I would say that Serral is among the more frank and level-headed players I've seen in the various videogames I've followed, so I wouldn't be too hasty to write off his assessment for this reason.


Not once humans adapted to it afaik. AlphaStar got to top grandmaster level and then that was it, as people found ways to beat it. Now, it may be that the team considered the project complete and stopped training it. But technically - as it stands - Starcraft is still the one game where humans beat AI.


No, the version which played on ladder was much weaker than the later version which played against pros and was at BlizzCon -- the later version was at professional level of play.


There were numerous issues. First one (somewhat mitigated lately) was extremely large number of actions per minute and (most importantly) extremely fast reaction speed.

Another big issue is that the bot communicated with the game via a custom API, not a via images and clicks. Details of this API are unknown - like how invisible units were handled, but it was much higher level than a human would have (pixels).

If you look at the games, the bot wasn't clever (which was a hope), just fast and precise. And some people far from the top were able to beat it convincingly.

And now the project is gone, even before people had a chance to really play against the bot and find more weaknesses.


That’s not entirely correct, as I know of at least one approach to neural program synthesis that employs self supervised learning.

https://arxiv.org/abs/2006.08381

It’s a slightly different, easier problem: generating programs based on example outputs, rather than natural language specifications.


The difference is that DreamCoder has a hand-crafted PCFG [1] that is used to generate programs, rather than a large language model. So the difference is in how programs are generated.

________

[1] The structure of the PCFG is hand-crafted, but the weights are trained during learning in a cycle alternating with neural net training. It's pretty cool actually, thought a bit over-engineered if you ask me.


Right, I think it’s a bit crazy not to use a grammar as part of the generation process when you have one. My guess is that constraining LLM generation with a grammar would make it way more efficient. But that’s more complicated than just throwing GPT3 at all of Github.

Also, my understanding is that Dreamcoder does some fancy PL theory stuff to factorize blocks of code with identical behavior into functions. Honestly I think that’s the key advance in the paper, more than the wake-sleep algorithm they focus on.

Anyways the point was more that self supervised learning is quite applicable to learning to program. I think the downside is that the model learns its own weird, non-idiomatic conventions, rather than copying github.


I guess you're right. The sleep-wake cycle is like a kind of roundabout and overcomplicated EM process. I've read the paper carefully but theirs is a complicated approach and I'm not sure what its contributions are exactly. I guess I should read it again.

Yes, it's possible to apply self-supervised learning to program synthesis, because it's possible to generate programs. It's possible to generate _infinite_ sets of programs. The problem is that if you make a generator with Universal Turing Machine expressivity, you're left with an intractable search over an infinite search space. And if you don't generate an infinite set of programs, then you 're left with an incomplete search over a space that may not include your target program. In the latter case you need to make sure that your generator can generate the programs you're looking for, which is possible, but it limits the approach to only generating certain kinds of programs. In the end, it's the easiest thing to create a generator for progams that you already know how to write- and no others. How useful is that is an open question. So far no artificial system has ever made an algorithmic contribution, to my knowledge, in the sense of coming up with a new algorithm for a problem for which we don't have good algorithms, or coming up with an algorithm for a problem we can't solve at all.

My perception is influenced by my studies, of course, but for me, a more promising approach than the generate-and-test approach exemplified by DreamCoder and AlphaCode etc. is Inductive Programming, which is to say, program synthesis from input-output examples only, without examples of _programs_ (the AlphaCode paper says that is an easier setting but I very disagree). Instead of generating a set of candidate programs and trying to find a program that agrees with the I/O examples, you have an inference procedure that generates _only_ the programs that agree with the I/O examples. In that case you don't need to hand-craft or learn a generator. But you do need to impose an inductive bias on the inference procedure that restricts the hypothesis language, i.e. the form of the programs that can be learned. And then you're back to worrying about infinite vs. incomplete search spaces. But there may be ways around that, ways not available to purely search-based systems.

Anyway program synthesis is a tough nut to crack and I don't think that language models can do the job, just like that. The work described in the article above, despite all the fanfare about "reasoning" and "critical thinking" is only preliminary and its results are not all that impressive. At least not yet. We shall see. After all, DeepMind has deep resources and they may yet surprise me.


Could you provide some papers/link related to Inductive Logic Programming? I'd like to look at other techniques in this space.


My pleasure!

- First, some more recent work, mostly overviews.

1. The following is the most recent overview of the field I'm aware of:

Inductive logic programming at 30 (Cropper et al, 2020)

https://www.doc.ic.ac.uk/~shm/Papers/ilp30.pdf

2. And a slightly shorter version of the same paper that summarises new trends:

Turning 30: New Ideas in Inductive Logic Programming (Cropper et al, 2020)

https://www.ijcai.org/Proceedings/2020/0673.pdf

3. Here's a short introduction to the relatively new ILP direction of learning Answer Set Programming:

Inductive Logic Programming in Answer Set Programming (Corapi et al, 2011)

https://link.springer.com/chapter/10.1007/978-3-642-31951-8_...

4. This is an overview of Meta-Interpretive Learning (MIL), a new approach to ILP that overcomes many difficulties of earlier approaches (Full disclosure: my own work is on MIL, though not the article linked):

Meta-Interpretive Learning: achievements and challenges (Stephen Muggleton, 2017)

https://www.doc.ic.ac.uk/~shm/Papers/rulemlabs.pdf

5. And this is a (short vesion) of a paper on δILP, a neural-net based ILP system:

Learning Explanatory Rules from Noisy Data (Evans and Grefenstette, 2018)

https://www.ijcai.org/Proceedings/2018/0792.pdf

- Next, some earlier work that is still relevant:

6. This is the inaugural paper of the field, that first named it (a little heavy reading though):

Inductive Logic Programming (Stephen Muggleton, 1990)

https://www.doc.ic.ac.uk/~shm/Papers/ilp.pdf

7. Here's an early paper on predicate invention, an important technique in ILP (only recently fully realised via MIL):

Predicate Invention in ILP - an Overview (Irene Stahl, 1993)

https://link.springer.com/chapter/10.1007%2F3-540-56602-3_14...

8. And an early overview of learning recursion (and performing predicate invention) that also lists several early ILP systems:

Inductive synthesis of recursive logic programs:achievements and prospects (Flener and Yilmaz, 1999)

https://core.ac.uk/download/pdf/82810434.pdf

That should be enough to get you started. I recommend reading in the order I linked to the various articles. I tried to give links to documents that I know can be read for free.

Unfortunately most of the material on ILP is either in scholarly articles, or, where there are textbooks, they tend to be older. That sounds bad, but there has been much new work recently with several new approaches.

Let me know if you're looking for more specific information. See my signature for contact details- I'm happy to answer emails about ILP :)


It should be emphasised that inductive programming is not tied to logic programming, and works for every other programming paradigm as well, e.g. functional programming [1, 2]. We could also do IP for imperative programming, although, as far as I am aware, nobody has done this.

[1] Feser et al's Lambda-Learner https://www.cs.utexas.edu/~swarat/pubs/pldi15.pdf

[2] S. Katayama's MagicHaskeller http://nautilus.cs.miyazaki-u.ac.jp/~skata/MagicHaskeller.ht...


That's absolutely true! But the OP asked about ILP in particular.

To be fair, logic and functional programming languages do have some advantages as target languages for Inductive Programming compared to imperative languages in that they have very simple syntax. For example, Prolog doesn't even have variable declarations. That's very convenient because the learning system only needs to learn the logic of the program, not the syntax of the language also. It's also much simpler to define language bias or program schemata etc constraints on the form of hypotheses in such languages, or even order programs by generality. For instance, Prolog has unification built-in and unification is used in ILP to order programs by generality (by testing for subsumption). All this machinery would have to be implemented from scratch in an imperative language.

Although the reason that logic and functional programming languages are given more weight in IP is probably for historical reasons, because Lisp and Prolog were, for a long time, "the languages of AI".

I'm trying to remember... I think there's been some IP work on imperative languages, maybe even Python. I'll need to check my notes.


> ILP to order programs by generality

Sorry, naive question: does ILP test candidate programs by increasing or decreasing generality?


Not naive at all! One common categorisation of ILP approaches is by whether they search for programs from the most to the least general (least general is more specific), or from the least to the most general. Some approaches do a little bit of both. Approaches that search from general to specific are known as "top-down" and approaches that search from specific to general are known as "bottom-up".

The "top" and "bottom" terms refer to a lattice of generality between programs, where generality is typically measured by subsumption or entailment etc. Subsumption in particular is a syntactic relation (that implies a semantic one, entailment) so "searching" a space of logic programs ordered by subsumption means in practice that the space of programs is constructed by generalising or specialising some starting program by means of syntactic transformation according to subsumption (e.g. a first order clause can be specialised by adding literals to it: P(x):- Q(x) subsumes P(x):- Q(x), R(x). The simplest intuition is to remember that by adding more conditions to a rule we make it harder to satisfy).

A more general program entails more logical atoms and ILP algorithms are typically trained on both positive and negative example atoms of a target program, so top-down approaches begin with an over-general program that entails all the positive examples and some or all of the negative examples and specialise that program until it entails only the positive examples. Bottom-up approaches start with an over-specialised program that entails none of the positive examples and generalise it until it entails all the positive examples.

The mathematics of generalisation are at the core of ILP theory and practice. It's what sets ILP apart from statistical machine learning which is based on the mathematics of optimisation.


Thank you so much!


That’s a big question but I’m tempted to answer it with a yes. A protein sequence contains a complete description of the structure of a protein but a coding question contains unknowns and the answers contain subjective variability.


We have a clue as to what it is (these are just functions at the end of the day) but don't know how the model's learned parameters relate to the problem domain. I saw a talk (maybe of Jeff Dean?) a while back that discussed creating models that could explain why certain features weighed more than others. Maybe with more approaches targeted towards understanding, these algorithms could start to seem less and less like a semantically opaque computational exercise, and more in line with how we humans think about things.


github autopilot scares me every time I write code on my personal pc and get those auto-suggestions. I am happy we dont have it at work yet.

It is clear writing code will soon be something of the past; maybe it is a bad idea to train our children to code. Let's make sure we milk every penny before the party is over!


Maybe… maybe… tools like Copilot will allow us to work at a higher level of abstraction (like optimizing compilers have allowed us to do).

I say maybe because so far the code that Copilot has generated for me has been impressive for what it is, but riddled with obvious and subtle bugs. It’s like outsourcing my function implementations to a C-student undergraduate intern. I definitely wouldn’t use any of its code without close scrutiny.

AI will make some software engineering tasks more efficient and more accessible but human programmers are not going anywhere any time this side of the Singularity.


I sometimes read these and wonder if I need to retrain. At my age, I’ll struggle to get a job at a similar level in a new industry.

And then I remember that the thing I bring to the table is the ability to turn domain knowledge into code.

Being able to do competitive coding challenges is impressive, but a very large segment of software engineering is about eliciting what the squishy humans in management actually want, putting it into code, and discovering as quickly as possible that it’s not what they really wanted after all.

It’s going to take a sufficiently long time for AI to take over management that I don’t think oldies like me need to worry too much.


The thing is that we don't know. What I also have been seeing for a while (like for at least for a decade) that whatever profession seemed to be in danger, whichever profession came out on top on (guess) lists like "these will be replaced by AI soon", each and every one of them thought that it can't happen to them and they all had (and continue to have) explanations, usually involving how that jobs needs human ingenuity. (Unlike all the others, of course :) )

Now completely I agree with you that a significant part of our job is understanding and structuring the problem, but I'm not sure it can't be done in another way. We usually get taking in when we think about what machines will be able to do by thinking that just because we use intelligence (general/human intelligence) to solve the task it means that it's a requirement. Think chess. Or even calculating (as in, with numbers). Or go. Etc.

The funny thing is that we don't know, until someone does it. I've been thinking for a while that a lot of what I do could be done by a chat bot. Asking clarification questions. Of course, I do have a lot of background knowledge and that's how I can come up with those questions, but that knowledge is probably easy to acquire from the internet and then use it as training data. (Just like we have an awful lot of code available, we have a lot of problem descriptions, questions, comments and some requirement specifications/user guides.)

The hard part would probably be not what we have learned as a software developer, but the things we have learned while we were small kids and also the things that we have learned since, on the side. I.e. being a reasonable person. Understanding what people usually do and want. So the shared context. But I'm not sure it's needed that much.

So yeah, I can imagine a service that will talk to a user about what kind of app they want (first just simpler web sites, web shops, later more and more complicated ones) and then just show them "here is what it does and how it works". And then you can say what you'd like to be changed. The color or placement of a button (earlier versions) or even the association type between entities (oh, but a user can have multiple shipping addresses).


I think programmers are relatively "safe" from AI for the simple reason they are the ones who talk to AI.

The job of programmers is to have machines do stuff so that humans don't have to, and of course, they do it for themselves too. Scripts, libraries, compilers, they are just tools to avoid flipping bits by hand. If something like copilot is not embraced by all programmers, it is that it is often less than helpful, and even then, some have adopted it. If we have super-advanced AI that can have a high level understanding of a problem and writes the app for you, then it is not much more than a super-compiler, and there will be programmers who will tell the super-compiler what to do, think of it as a new, super high level programming language. The job will evolve, but there will always be someone who tells the computer what to do.

And if there is no one needed to tell the computer what to do, that's what some people call "the singularity". Programming, or its evolution will probably be the last technical job. Social jobs may continue further, simply because humans like humans because they are human. Maybe the oldest profession will also be the last profession.


What I was trying to convey is that I'm not sure at all that you'll need a programmer (i.e. someone who has the mindset and the skills of a person we call today as such) to talk to the AI. Because the AI may just be able to understand a sloppy description that the average user (or the average product owner) is able to communicate. And when/if not then it will be able to either ask clarification questions (like "what do you mean by account?") or just generate something and then let the user figure out if it's doing the right thing for them. If not, they can ask for changes or explain what they think was misunderstood.

And my (weak) conjecture is that we may not need an AGI/human level AI for this. In which case we might still want to have some software to be written. But you're right, I'm also not sure that there will be a point where we still want software but have very intelligent machines. And while saying that programmer will be the last technical job doesn't sound like a strong claim, I'd say say it would probably be teachers :)

> The job will evolve, but there will always be someone who tells the computer what to do.

Which may very well be the users, if the machine is able to follow a conversation. Now the thing that may be the showstopper for now might exactly be this: that the machine should be able to hold a context for long enough (over multiple iterations of back and forth communication). As far as my limited knowledge goes, this is something that they have not yet figured out.

The "our kind will always be needed" is exactly the fallacy I was talking about and the one that the practitioners of every intellectual professions seem to have. They think they will be needed to interface between the machine (whether it's a legal or a medical system) and the client. Because they assume that the machine will not be able to communicate only to process the existing knowledge base.

But again, the whole field evolves through surprising leaps. Yep, Copilot is not insanely useful, but already amusing/frightening enough. It seems to pick up context from all over the code base. Sometimes it goes totally wrong, and generates gibberish (I mean generate non existent identifiers that make sense as English expressions but ones that don't exist anywhere in the code). But quite a few times it picks up the intent (the pattern/thought pattern) even if it is spread out over a file (or several ones).


I imagine I'll be editing this a bit, so I apologize if there are obvious typos left from any changes I make while I'm thinking. Sorry for the mini-essay. :)

Also, these points are not to be taken separately. They're part of a broader argument and should be treated as a unit.

1. Programming competitions are deliberately scoped down. Actual day-to-day work consists of meeting with stakeholders, conducting research, synthesizing that research with prior knowledge to form a plan, then executing. This work skips to the plan synthesis, relying on pattern-matching for the research component.

2. This current work, even if refined, would be insufficient to conduct daily programming work. This is just an extension of point 1; I acknowledge that you're talking about the future and a hypothetical better system.

3. The components required for your hypothetical programming bot are the components not covered by this work.

4. Context-aware/deep search tools are still very incomplete. There are some hints that better user-intent models are around the corner (i.e. companies like TikTok have built models that can adroitly assess users' intents/interests). I've seen no work on bringing those models to bear on something more nebulous like interpreting business needs. (But I also haven't been actively searching for them) Also, Google, who dumps a large amount of money into search every year, is among the best we have and it's definitely far from what we'd need for business-aware programming bots.

5. Conducting the research step in the programming process automatically will require better tools.

6. Conversational AI is still very incomplete. See Tay bot from Microsoft for examples of what goes wrong at scale. People, in general, are also not very aware of themselves during discussions and even very intelligent people get locked in a particular mindset that precludes further conversation. If a user tries fighting the bot by insisting that what they said should be sufficient (as they definitely do to other humans) that could pollute the bot's data and result in worse behavior.

7. Meeting with stakeholders part of the programming process automatically will also require better tools.

8. By points 5 & 7, critical domains still require more research. There is ongoing research in fields like Q&A, even some commercial attempts, but they're focused on mostly low-level problems ("construct an answer given this question and some small input")[0].

9. Advanced logical reasoning is advanced pattern matching + the ability to generate new reasoning objects on the fly.

10. Current systems are limited in the number of symbols they can manage effectively, or otherwise use lossy continuous approximations of meaning to side-step the symbol issue (it's a rough approximation of the truth, I think). See [1] for an up-to-date summary on this problem. Key phrase: binding problem neural networks

11. Current "reasoning" systems do not actually perform higher level reasoning. By points 9+10.

12. Given the rich history and high investment over time these fields (points 4, 6, and 11), it is unlikely that there will be a sufficiently advanced solution within the next 15-40 years. These fields have been actively worked for decades; the current influx of cash has accelerated only certain types of work: work that generates profit. Work on core problems has kept going at largely the same pace as usual because the core problems are hard-- extra large models can only take you so far, and they're not very useful without obnoxious amounts of compute that aren't easily replicated.

13. Given the long horizon in point 12, programmers will likely be required to continue to massage business inputs into a machine-usable format.

The horizon estimate in point 11 was a gut estimate and assumes that we continue working in parallel on all of the required subproblems, which is not guaranteed. The market is fickle and might lay off researchers in industry labs if they can't produce novel work quickly enough. With the erosion of tenure-track positions taking place in higher education (at least in the US) it's possible that progress might regress to below what it was before this recent AI boom period.

[0]: https://research.facebook.com/downloads/babi/ [1]: https://arxiv.org/pdf/2012.05208.pdf


there will always be someone who tells the computer what to do

Until the computer starts telling people what to do


> Until the computer starts telling people what to do

My phone has me well trained. All it has to do is play a short message tone and I'll come running...


Uh...that already happens and AI isn't event required


That actually gave me chills. We're doomed!


"Maybe the oldest profession will also be the last profession."

-- GuB-42, Wednesday February 2, 2022


Problem is, to "talk with AI", most developers would need to 'retrain' (to use GP's word).

Writing and training a neural network is very different from writing a common program.


But it's not that. We're not talking about training narrowly intelligent ML systems for specific problems. You're right, that's a distinct skill. We're talking about a ML system that can write code based on some higher-than-now level of human input. What that level will/could be is what we're arguing about. Whether it has to be done by some kind of programmer-like person or whether it can be a more generic user/product-owner/product manager. I.e. someone who understands the problem domain but doesn't know too much about the solution domain/technology.

Those ML/AI systems will also have to be built, coded and trained but that's a job for a very small set of people compared to the total number of end users (and the total number of developers on the market today). And, as the ML/AI field stands, it always seem to turn out that specialized algorithms that do what the ML layer cannot do, get pretty quickly eliminated by the ML layer. So most solutions always gets closer and closer to end-to-end.


Your logic is completely flawed. If there is a super AI with real intelligence that understands problems and codes it up for you, why wouldn't it be possible to go one step further and solve problems on its own? Why do you think that a human programmer has to feed a problem statement to the AI for it to work?


It is what I meant by "the singularity". AIs that are so intelligent that they don't need humans, including when it comes to building better AIs. The idea is that they get in a runaway self-improving cycle and what they do after that and the place of humanity is left to the imagination.

I don't believe in the singularity, but if we get to the point where AIs don't need human programmers anymore, things are going to get... interesting.


I sometimes get the feeling that all my coding is actually a class of mathematical transforms that I have no idea how to define but feel very strongly that it is definable and AI-able.

Well it’d a curious day when an AlphaGo moment hits coding. Would be funny if it happened at the same time as Fed rate increases and destabilizing world events this year (the path from median human to top human is shallow). Mass firing of a few million highly paid redundancies out of the blue? Would be quite a sight.

Or maybe it wouldn’t happen that way, but rather it would pave the way for a leaner set of startups that were built with the power to do the same thing at the same or better velocity with an order of magnitude or fewer people.


What professions are these? Chat bots didn't eliminate human CSRs. OCR didn't eliminate human data entry. Object detection hasn't eliminated human intelligence analysts. Machine translation hasn't eliminated human translators. Humans still make a living as professional Chess and Go players. Truck drivers were supposed to be on the chopping block a decade ago, yet they're more in demand now than ever. Human radiologists haven't gone anywhere. Even GPT-N hasn't eliminated human writers. Human transcriptionists haven't even been eliminated. We just have a lot more videos that automatically get shitty transcriptions instead of none at all now.


Have you tried out Github Copilot yet? I find it super interesting. It turns writing code into more of a write some comments, let it write the code, review the code, realize what I actually need it to do, revise the comments, then tighten up the generated code.

Most surprisingly I can quickly tackle domains that require libraries I don't know because a combination of code generation and IDE hinting means I can write comments and pseudo code and the tool then provides at least a first pass best method to use.

Can't say if I write better code with Copilot but it's worth experiencing!


I've been playing with Copilot as well.

It's very good at handling boilerplate and making contextual suggestions.

I don't see it eating my cake, but it's definitely a very useful tool for saving time.


I think a good yardstick for this is something that is generative; so, for instance, can the system generate a good programming challenge question? This is still a no.


Although most good developers will likely keep their jobs for the foreseeable future, the relative importance, and payoff, of different skills might change.

Lower-level coding could become more and more automated, raising the values and wages of complementary skills such as requirements elicitation and understanding of business impact from technological decisions. [1]

Some of these, however, can be done by businesspeople who know how to think and express their ideas precisely, such that a neural model can turn them into a decent draft of code. (These days, many more youths learn to code before going into other fields. They have training for thinking precisely.) There can be fewer job opportunities for some groups of developers.

Thus, a hedge against possible job loss is still required. Owning substantial equity in a company/startup and other assets would be one good strategy.

[1] https://en.wikipedia.org/wiki/Complementary_good


AI might take over management quicker than you think. If the objective is to get the rocket into space, AI might know the requirements better than humans at this rate.


I agree with you on this, since we humans want things in a vague form, and it's still very hard for computers to infer insights from those ambiguous requirements. Not very easy to do that by taking derivatives of function compositions.


This is extremely impressive, but I do think it’s worth noting that these two things were provided:

- a very well defined problem. (One of the things I like about competitive programming and the like is just getting to implement a clearly articulated problem, not something I experience on most days.) - existing test data.

This is definitely a great accomplishment, but I think those two features of competitive programming are notably different than my experience of daily programming. I don’t mean to suggest these will always be limitations of this kind of technology, though.


I don't think it's quite as impressive as you make it out to be. Median performance in a Codeforces programming competition is solving the easiest 1-2 problems out of 5-6 problems. Like all things programming the top 1% is much, much better than the median.

There's also the open problem of verifying correctness in solutions and providing some sort of flag when the model is not confident in its correctness. I give it another 5 years in the optimistic case before AlphaCode can reliably compete at the top 1% level.


This is technology that simply didn't exist in any form 2 years ago. For no amount of money could you buy a program that did what this one does. Having been watching the growth of Transformer-based models for a couple years now really has hammered home that just as soon as we figure out how an AI can do X, X is no longer AI, or at least no longer impressive. How this happens is with comments like yours, and I'd really like to push back against it for once. Also 5 years? So assuming that we have all of the future ahead of us, to think that we only have 5 years left of being the top in programming competitions seems like it's somehow important and shouldn't be dismissed with "I don't think it's quite as impressive as you make it out to be."


I don't think that's what happening. Let's talk about this case: programming. It's not that people are saying "an AI programming" isn't impressive or isn't AI, it's that when people say "an AI programming" they aren't talking about ridiculously controlled environments like in this case.

It's like self-driving cars. A car driving itself for the first time in a controlled environment, I'm sure, was an impressive feat, and it wouldn't be inaccurate to call it a self-driving car. However, that's not what we're all waiting for when we talk about the arrival of self-driving cars.


And if AI programming were limited to completely artificial contexts you would have a point, though I'd still be concerned. We live in a world, however, where programmers routinely call on the powers of an AI to complete their real code and get real value out of it. This is based on the same technology that brought us this particular win, so clearly this technology is useful outside "ridiculously controlled environments."


That's not significantly different than how programming has worked for the last 40 years though. We slowly push certain types of decisions and tasks down into the tools we use, and what's left over is what we call 'programming'. It's cool, no doubt, but as long as companies need to hire 'prorammers', then it's not the huge thing we're all looking out over the horizon waiting for.


Programmers do setup completely artificial contexts so AI can work.

None of the self driving systems where setup by giving the AI access to sensors, a car, and the drivers handbook and saying well you figure it out from there. The general trend is solve this greatly simplified problem, this more complex one, up to dealing with the real world.


By AI programming I mean the AI doing programming, not programming the AI. Though soon enough the first will be doing the second and that's where the loop really closes...


>> This is technology that simply didn't exist in any form 2 years ago.

A few examples of neural program synthesis from at least 2 years ago:

https://sunblaze-ucb.github.io/program-synthesis/index.html

Another example from June 2020:

DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning

https://arxiv.org/abs/2006.08381

RobustFill, from 2017:

RobustFill: Neural Program Learning under Noisy I/O

https://www.microsoft.com/en-us/research/wp-content/uploads/...

I could go on.

And those are only examples from neural program synthesis. Program synthesis, in general, is a field that goes way back. I'd suggest as usual not making big proclamations about its state of the art without being acquainted with the literature. Because if you don't know what others have done every announcement by DeepMind, OpenAI et al seems like a huge advance... when it really isn't.


Of course program synthesis has been a thing for years, I remember some excellent papers out of MSR 10 years ago. But which of those could read a prompt and build the program from the prompt? Setting up a whole bunch of constraints and having your optimizer spit out a program that fulfills them is program synthesis and is super interesting, but not at all what I think of when I'm told we can make the computer program for us. For instance, RobustFill takes its optimization criteria from a bundle of pre-completed inputs and outputs of how people want the program to behave instead of having the problem described in natural language and creating the solution program.


Program synthesis from natural language specifications has existed for many years, also. It's not my specialty (neither am I particularly interested in it), but here's a paper I found from 2017, with a quick search:

https://www.semanticscholar.org/paper/Program-Synthesis-from...

AlphaCode is not particularly good at it, either. In the arxiv preprint, besides the subjetive and pretty meaningless "evaluation" against human coders it's also tested on a formal program synthesis benchmark, the APPS dataset. The best performing AlphaCode variant reported in the arxiv preprint solves 25% of the "introductory" APPS tasks (the least challenging ones). All AlphaCode variants tested solve less than 10% of the "interview" and "competition" (intermediary and advanced) tasks. These more objective results are not reported in the article above, I think for obvious reasons (because they are extremely poor).

So it's not doing anything radically new and it's not doing it particularlly well either. Please be better informed before propagating hype.

Edit: really, from a technical point of view, AlphaCode is a brute-force, generate-and-test approach to program synthesis that was state-of-the-art 40 years ago. It's just a big generator that spams programs hoping it will hit a good one. I have no idea who came up with this. Oriol Vinyals is the last author and I've seen enough of that guy's work to know he knows better than bet on such a primitive, even backwards approach. I'm really shocked that this is DeepMind work.


I've also worked in the area and published research in it a couple years ago. I almost worked for a company focused on neural program synthesis but they did a large pivot and gave up a couple years ago working on much simpler problems and decided that current research was not good enough to do well on problems like this. I had a paper accepted that translated between toy programming languages 3ish years ago. Toy here meaning about complexity of simply typed lambda calculus that I wrote language in a couple hundred lines.

This and copilot are much better than level of problems being tackled a couple years ago.


I don't agree. I don't know your work, but the approach in AlphaCode and Copilot (or the Codex model behind it) is a step backwards for neural program synthesis, and for program synthesis in general. The idea is to train a large language model to generate code. The trained language model has no way to direct its generation towards code that satisfies a specification. It can only complete source code from some initial prompt. The code generated is remarkably grammatical (in the context of a programming language grammar) which is certainly an advance for language representation, but some kind of additional mechanism is required to ensure that the generated code is relevant to the specification. In Copilot, that mechanism is the user's eyballs. In AlphaCode the mechanism is to test against the few I/O examples of the programming problems. In either case, the whole thing is just hit-and-miss. The language model generates mostly garbage -DeepMind brags that AlphaCode generates "orders of magnitude" more code than previous work, but that's just to say that it's even more random, and its generation misses the target even more than previous work! Even filtering on I/O examples is not enough to control the excessive over-generation and so additional measures are needed (clustering and ranking of programs etc).

All this could be done 40 years ago with a dumb DSL, or perhaps a more sophisticated system like a PCFG for programs, with a verifier bolted on [1]. It's nothing new. What's new is that it's done with a large language model trained with a Transformer, which is all the rage these days, and of course that it's done at the scale and with the amount of processing power available to DeepMind. Which I'm going to assume you didn't have back when you published your work.

Honestly, this is just an archaic, regressive approach, that can only work because of very big computers and very big datasets.

___________

[1] Which btw, is straightforard to do "by hand" and is something that people do all the time. In the AlphaCode work, the large language model simply replaces a hand-crafted program generator with a lot of data, but there is no reason to do that. This is the quintessential problem where a machine learning solution is not necessary because a hand-crafted solution is available, and easier to control.


I agree the approach itself is quite brute force heavy. There is a lot of information that could be used and I’d hope is helpful like grammar or language, traces/simulations of behavior, etc.

But when I say alpha code/copilot is good I’m referring solely to the difficulty of problems they are doing. There are many papers including mine that worked on simpler problems with more structure used to work on them.

I expect follow up work will include actually incorporating other knowledge more heavily to the model. My work was mainly on restricting tree like models to only make predictions following grammar of the language. Does that parallelize/fit well with a transformer? Unsure, but I would expect some language information/genuine problem constraints to be incorporated in future work.

Honestly I am pretty surprised how far pure brute force with large model is going. I would not have expected gpt3 level language modeling from more scale on a transformer and little else.


Well, I'm not surprised because I know that large language models can learn smooth approximations of natural language. They can generate very grammatical natural English, so why not grammatical source code, which is easier? Of course, once you have a generator for code, finding a program that satisfies a specification is just a matter of searching -and assuming the generated code includes such a program. But it seems like that isn't really the case with AlphaCode, because its performance is very poor.

I have to say that usually I'm the one speaking out against an over-reliance on machine learning benchmarks and against expecting a new approach to beat the state of the art before it can be taken seriously, but this is not a new approach, and that's the problem I have here. It's nothing new, repackaged as something new and sold as something it isn't ("reasoning" and "critical thinking" and other nonsense like that).

I agree that future work must get smarter, and incorporate some better inductive biases (knowledge, something). Or perhaps it's a matter of searching more intelligently because given they can generate millions of programs I'd have thought they'd be able to find more programs that approximate a solution.


Has someone tried classical program synthesis techniques on competitive programming problems? I wonder what would have been possible with tech from more than 2 years ago.


I don't know if anyone has tried it, but it's not a very objective evaluation. We have no good measure of the coding ability of the "median level competitor" so doing better or worse than that, doesn't really tell us anything useful about the coding capability of an automated system.

So my hunch is that it probably hasn't been done, or hasn't been done often, because the program synthesis community would recognise it's pointless.

What you really want to look at is formal program synthesis benchmarks and how systems like AlphaCode do on them (hint: not so good).


You don't think it's impressive, yet you surmise that a computer program could compete at a level of the top 1% of all humans in five years?

That's wildly overstating the promise of this technology, and I'd be very surprised if the authors of this wouldn't agree.


Agree. If an AI could code within the top 1%, every single person whose career touches code would have their lives completely upended. If that’s only 5 years out…ooof.


Top 1% competitive programming level means that it can start solving research problems, problem difficulty and creativity needed for problems goes up exponentially for harder problems and programming contests have lead to research papers before. It would be cool if we got there in 5 years but I doubt it. But if we got there it would revolutionize so many things in society.


I do kinda wonder if it'd lead to as good results if you just did a standard "matches the most terms the most times" search against all of github.

I have a suspicion it would - kinda like Stack Overflow, problems/solutions are not that different "in the small". It'd have almost certainly given us the fast square root trick verbatim, like Github's AI is doing routinely.


Can't rule it out, but if Alphacode gets to top 1% in five years, that's when it can basically do algorithms research. We can ask it to come up with new algorithms for all the famous problems and then just have to try and understand it's solutions :O


100% agree. Someone (who?) had to take time and write the detailed requirements. In real jobs you rarely get good tickets with well defined expectations; it's one of most important developer's jobs to transform fuzzy requirement into a good ticket.

(Side note: I find that many people skip this step, and go straight from fuzzy-requirement-only-discussed-on-zoom-with-Bob to code; open a pull request without much context or comments; and then a code reviewer is supposed to review it properly without really knowing what problem is actually being solved, and whether the code is solving a proper problem at all).


So what happens when OpenAI releases TicketFixer 0.8 which synthesizes everything from transcripts of your meetings to the comments to the JIRA ticket to the existing codebase and spits out better tickets to feed into the programming side?


Yup, I hope that'll happen. Then engineers would just end up being done at a higher level of abstraction closer to what designers do with wireframes and mockups.

Kind of the opposite of the way graphic design has evolved. Instead of getting more involved in the process and, in many cases, becoming front-end developers, it'll become more abstract where humans make the decisions and reason about what to include/exclude, how it'll flow, etc.

Even TicketFixer wouldn't be able to do more than offer a handful of possible solutions to design-type issues.


Yeah, we need our TicketFixer to also include the No_Bob 0.2 plugin that figures out that a decent percentage of the time whatever "Bob" is asking for in that meeting is not what "Bob" thinks he is asking for or should be asking for and can squash those tickets. Without that we're gonna somehow end up with spreadsheets in everything.


Haha, yeah, there's that, but there are also things like "adding a dark mode." There are a dozen ways to accomplish that kind of thing, and every company's solution will diverge when you get down to the details.


Take my money.


Is the next step in the evolution of programming having the programmer become the specifier?

Fuzzy business requirements -> programmer specifies and writes tests -> AI codes


That's all we've ever been since we invented software.

First we specified the exact flow of the bits with punch cards.

Then we got assembly and we specified the machine instructions.

Then we got higher level languages and we specified how the memory was to be managed and what data to store where.

Now we have object oriented languages that allow us to work with domain models, and functional languages that allow us to work data structures and algorithms.

The next level may be writing business rules, and specifying how services talk to each other, who knows, but it will be no different than it is now just a higher level.


If its anything like my job

while(1) { Fuzzy business requirements -> programmer specifies and writes tests -> AI codes }


Maybe the problem transformation will be both the beginning _and_ end of the developer's role.


But it's easy to create AI conversation that will refine problem.


> One of the things I like about competitive programming and the like is just getting to implement a clearly articulated problem

English versions of Codeforces problems may be well-defined but they are often very badly articulated and easy to misunderstand as a human reader. I still can't understand how they got AI to be able to generate plausible solutions from these problem statements.


They used the tests. The specification being very approximate is fine, because they had a prebuilt way to "check" if their result was good.


Wait what, they cheated to get this result? Only pretests are available to competitors before submitting. If they had access to the full test suite, then they had a HUGE advantage over actual competitors, and this result is way less impressive than claimed. Can you provide a source for this claim? I don't want to read the full paper.


If AlphaCode had access to full test suite then the result is not surprising at all.

You can fit anything given enough parameters.

https://fermatslibrary.com/s/drawing-an-elephant-with-four-c...


I think they will always be limitations.

Software is, ultimately, always about humans. Software is always there to serve a human need. And the "intelligence" that designs software will always, at some level, need to be intelligence that understands the human mind, with all it's knowledge, needs, and intricacies. There are no shortcuts to this.

So, I think AI as a replacement for software development professionals, that's currently more like a pipe dream. I think AI will give us powerful new tools, but I do not think it will replace, or even reduce, the need for software development professionals. In total it might even increase the need for software development professionals, because it adds another level to the development stack. Another level of abstraction, and another level of complexity that needs to be understood.


This seems to have a narrower scope than GitHub Copilot. It generates more lines of code to a more holistic problem vs. GitHub Copilot that works as a "more advanced autocomplete" in code editors. Sure Copilot can synthesize full functions and classes but for me, it's the most useful when it suggests another test case's title or writes repetitive code like this.foo = foo; this.bar = bar etc...

Having used Copilot I can assure you that this technology won't replace you as a programmer but it will make your job easier by doing things that programmers don't like to do as much like writing tests and comments.


Having used Copilot for a while, I am quite certain it will replace me as a programmer.

It appears to me that when it comes to language models, intelligence = experience * context. Where experience is the amount what's encoded in the model, and context is the prompt. And the biggest limitation on Copilot currently is context. It behaves as an "advanced autocomplete" because it all is has to go on is what regular autocomplete sees, e.g. the last few characters and lines of code.

So, you can write a function name called createUserInDB() and it will attempt to complete it for you. But how does it know what DB technology you're using? Or what your user record looks like? It doesn't, and so you typically end up with a "generic" looking function using the most common DB tech and naming conventions for your language of choice.

But now imagine a future version of Copilot that is automatically provided with a lot more context. It also gets fed a list of your dependencies, from which it can derive which DB library you're using. It gets any locatable SQL schema file, so it can determine the columns in the user table. It gets the text of the Jira ticket, so it can determine the requirements.

As a programmer a great deal of time is spent checking these different sources and synthesising them in your head into an approach, which you then code. But they are all just text, of one form or another, and language models can work with them just as easily, and much faster, than you can.

And one the ML train coding gets running, it'll only get faster. Sooner or later Github will have a "Copilot bot" that can automatically make a stab at fixing issues, which you then approve, reject, or fix. And as thousands of these issues pile up, the training set will get bigger, and the model will get better. Sooner or later it'll be possible to create a repo, start filing issues, and rely on the bot to implement everything.


Copilot is cool and all.

I didn't find reading largely correct but still often wrong code is a good experience for me, or it adds up any efficiency.

It does do a very good job in intelligently synthesize boilerplate for you, but be Copilot or this AlphaCode, they still don't understand the coding fundamentals, in the sense causatively, what would one instruction impact the space of states.

Still, those are exciting technology, but again, there is a big if whether such machine learning model would happen at all.


I'm skeptical it'll replace programmers, as in no more human programmers, but agree in the sense 100% human programmers -> 50%, 25%, 10% human programmers + computers doing most of the writing of actual code.

I see it continuing to evolve and becoming a far superior auto-complete with full context, but, short of actual general AI, there will always be a step that takes a high-level description of a problem and turns it into something a computer can implement.

So while it will make the remaining programmers MUCH more productive, thereby reducing the needed number of programmers, I can't see it driving that number to zero.


It will probably change the types of things a programmer does, and what it looks like to be a programmer. The nitty gritty of code writing will probably get more and more automated. But the architecture of the code, and establishing and selecting it's purpose in the larger scheme of a business, will probably be more what programmers do. Essentially, they might just become managers for automated code writers, similar to the military's idea of future fighter pilots relating to autonomous fighters/drones as described in this article:

https://www.newyorker.com/magazine/2022/01/24/the-rise-of-ai...

Maybe. It might never get to that level though.


Yup, I think that's it exactly. I just described this in another comment as a reverse of the evolution that graphic design has undergone in bringing them into programming front-ends.

I can't wait to see how far we're able to go down that path.


I have a feeling this is the correct read in terms of progression. But I'm skeptical if it'll ever be able to synthesize a program entirely. I imagine that in the future we'll have some sort of computer language more like written language that will be used by some sort of AI to generate software to meet certain demands, but might need some manual connections when requirements are hazy or needs a more human touch in the UI/UX


> But I'm skeptical if it'll ever be able to synthesize a program entirely.

Emotional skepticism carries a lot more weight in worlds where AI isn't constantly doing things that are meant to be infeasible, like coming 54th percentile in a competitive programming competition.

People need to remember that AlexNet is 10 years old. At no point in this span have neural networks stopped solving things they weren't meant to be able to solve.


I feel like you're taking that sentence a bit too literally. I read it as "I'm skeptical that AI will ever be able to take a vague human description from a product manager/etc. and solve it without an engineer-type person in the loop." The issue is humans don't know what they want and realistically programs require a lot of iteration to get right, no amount of AI can solve that.

I agree with you; it seems obvious to me that once you get to a well-specified solution a computer will be able to create entire programs that solve user requirements. And that they'll start small, but expand to larger and more complex solutions over time in the same way that no-code tools have done.


Google Ambiguity.


repetitive code like this.foo = foo; this.bar = bar etc...

This sort of boilerplate code is best solved by the programming language. Either via better built-in syntax or macros. Using an advanced machine learning model to generate this code is both error-prone and a big source of noise and code bloat. This is not an issue that will go away with better tooling; it will only get worse.


I don't think I agree. Most people spend more time reading than writing code so programming languages should be optimized to be easier to read whereas tooling should be made to simplify writing code. New syntax or macros sounds like it would make the language harder to read. I agree that an advanced machine learning model for generating boilerplate code isn't the right approach but I also don't think we should extend languages for this. Tooling like code generators and linters are a good middle ground.


New syntax or macros sounds like it would make the language harder to read.

Often the opposite is true. For example Java records are far easier to read and understand than the pages of boilerplate that they replace.


That sounds like an issue with how Java was designed. There are plenty of languages that solve Java's boilerplate problems without adding new syntax for records.


If you’ll review my original comment, I never said new syntax. I said better syntax. If your language design leads to a lot of boilerplate in idiomatic use then it needs to be better. Adding new syntax is just putting a bandaid on the problem.


FYI+IMO: Both Ruby and Scala have excellent ways to reduce these issues that occur at the language level, and make it easier to both read and write. I don't know either way if that means you should extend languages to handle it, but at least it's definitively possible to write the language that way from the beginning.

Otherwise yup, agree with you; ML for problematic boilerplate isn't the right approach, but other code generators and linters are really good and get you most of the way there.


it is a very similar argument to the one for powerful IDEs and underwhelming languages. to be fair, it’s not necessarily fruitless - e.g. with smalltalk. i fail to see the analoguous smalltalk-style empowerment of language using AI but perhaps something is there.

anyway. programming is automation; automation of programming is abstraction. using AI to write your code is just a bad abstraction - we are used to them


I feel like you are very defensive here and I want to be sure we take time to recognize this as a real accomplishment.

Seriously though, I do doubt I can be fully replaced by a robot any time soon, it may be the case that soon enough I can make high-level written descriptions of programs and hand them off to an AI to do most of the work. This wouldn't completely replace me, but it could make developers 50x productive. The question is how elastic is the market...can the market grow in step with our increase in productivitiy?

Also, please remember that as with anything, within 5 years we should see vast improvements to this AI. I think it will be an important thing to watch.


Yesterday, I spent several hours figuring out if the business requirement for "within the next 3 days" meant 3 calendar days or 72 hours from now. Then about 10 minutes actually writing the code. Everyone thought my efforts were very valuable.


100%. What makes us what we are is the mindset (in this case, this kind of "attention to detail); that didn't change with (first) compilers, (then) scripting languages, or (future?) AI-assisted programming.

PS - Lawyers aren't even as detail-oriented as we are, it's surprising.


Really?

Maybe that's true in general because the spread in skill for being able to make a living as a lawyer and the same as a programmer depends far less on that attention to detail being a core skill. Still, I wonder if that also holds at the high levels of the profession. I get the impression that at the FAANG-level, lawyers would compare pretty favorably to programmers in detail orientation. In particular, patent and contract law.

That said, it's just my general impression of what lawyers get up to.

...Hmm, thinking about the contract law thing a bit more. Yeah, I do believe you are right. Lawyers aren't writing nearly as many extremely detail-oriented texts as programmers are on a day-to-day basis. Their jobs are much more around finding, reading, and understanding those things and building stories around them.


The GPT family has already shown more than 50x productivity increase by being able to solve not one, but hundreds and perhaps thousands of tasks on the same model. We used to need much more data, and the model would be more fragile, and finding the right architecture would be a problem. Now we plug a transformer with a handful of samples and it works.

I just hope LMs will prove to be just as useful in software development as they are in their own field.


> but it could make developers 50x productive

More likely it will translate the abstraction level by some vector of 50 elements.


If you make developers 50x more efficient, won't you need 50x fewer developers?


>If you make developers 50x more efficient, won't you need 50x fewer developers?

Developers today are 50X more efficient than when they had to input machine code on punched tape, yet the number of developers needed today is far larger than it was in those times.



But think how large of a job program that would have been.

Hundreds of people manually writing assembly and paid middle class wages. Not a compiler in sight.

In the years leading up to the singularity I’d expect to see a lot of Graeberian “Bullshit Jobs”.

Everyone knows they’re BS but as a society we allow them because we aren’t willing to implement socialism or UBI.


There's no reason to believe that we'll need another 50x more developers, though.


There isn't? I feel like there's still a ton of places software hasn't even touched and not because it doesn't make sense, but because no one's gotten to it. It's not the most profitable thing people could write software for.


Even if not, the original claim was that we may see a 50X decrease and I personally don't think that is likely, pre-Singularity anyway :)


Greater efficiency leads to greater consumption unless demand is saturated. Given software’s ability to uncover more problems that are solvable by software, we’re more likely to build 50x more software.


This happened with the introduction of power tools to set building in Hollywood back in the day - literally this same question.

People just built bigger sets, and smaller productions became financially feasible. Ended up creating demand, not reducing it.


Not necessarily. Demand may be much higher than available supply right now. Tech companies will continue to compete, requiring spending on developers to remain competitive. Software is unlike manufacturing, in that the output is a service, not a widget. Worker productivity in general has not decreased the demand for full work weeks, despite projections in the early 20th century to the contrary. Of course, it is possible that fewer developers would be needed, but I don't think it's likely, yet.


To me it's not about it's current capabilities. It's the trajectory. This tech wasn't even a thing 2 years ago. There's billions being poured into it and every time someone uses these tools there's more free training data.


The big question seems to be whether par with professional programmers is a matter of increasing training set and flop size, or whether different model or multi-model architectures are required.

It does look like we've entered an era where programmers who don't use AI assistants will be disadvantaged, and that this era has an expiration date.


Relevant blogpost on codeforces.com (the competitive programming site used): https://codeforces.com/blog/entry/99566

Apparently the bot would have a rating of 1300. Although the elo rating between sites is not comparable, for some perspective, mark zuckerberg had a rating of ~1k when he was in college on topcoder: https://www.topcoder.com/members/mzuckerberg


The median rating is not descriptive of median ability, because a large number of Codeforces competitors only do one or a few competitions. A very small number of competitors hone their skills over multiple competitions. If we were to restrict our sample to competitors with more than 20 competitions, the median rating would be much higher than 1300. It's amazing that Alphacode achieved a 1300 rating, but compared to humans who actually practice competitive coding, this is a low rating.

To clarify, this is a HUGE leap in AI and computing in general. I don't mean to play it down.


>> To clarify, this is a HUGE leap in AI and computing in general. I don't mean to play it down.

Sorry, but it's nothing of the sort. The approach is primitive, obsolete, and its results are very poor.

I've posted this three times already but the arxiv preprint includes an evaluation against a formal benchmark dataset, APPS. On that more objective measure of performance, the best performing variant of AlphaCode tested, solved 25% of the easiest tasks ("introductory") and less than 10% of the intermediary ("interview") and advanced ("competition") tasks.

What's more, the approach that AlphaCode takes to program generation is primitive. It generates millions of candidate programs and then it "filters" them by running them against input-output examples of the target programs taken from the problem descriptions. The filtering still leaves thousands of candidate programs (because there are very few I/O examples and the almost random generation can generate too many programs that pass the tests, but still don't solve the problem) so there's an additional step of clustering applied to pare this down to 10 programs that are finally submitted. Overall, that's a brute-force, almost random approach that is ignoring entire decades of program synthesis work.

To make an analogy, it's as if DeepMind had just published an article boasting of its invention of a new sorting algorithm... bubblesort.


The APPS benchmark was for a “small 1B parameter model”, fine-tuned “without using clustering, tags, ratings, value conditioning, or prediction”.

> Overall, that's a brute-force, almost random approach that is ignoring entire decades of program synthesis work.

You don't get answers to these questions by random search. Not even close. I have looked at non-neural program synthesis papers. It is not remotely competitive.


Yes, apparently they couldn't use their full approach because "of missing information in the dataset". That points to a further limitation of the approach: it works for Codeforces problems but not for APPS problems (so it's very purpose-specific).

Btw, APPS is not much of a benchmark. It evaluates code generation according to how close it resembles code written by humans. That's standard fare for text generation benchmarks, like evaluating machine translation on some arbitrary set of human translations. There are no good benchmarks for text generation (and there are no good metrics either).

But the comparison against the average competitor on Codeforces is even more meaningless because we have no way to know what is the true coding ability of that average competitor.


> Btw, APPS is not much of a benchmark. It evaluates code generation according to how close it resembles code written by humans.

No, the metric used in this paper was the percentage of questions it could solve against the hidden tests.

> That points to a further limitation of the approach: it works for Codeforces problems but not for APPS problems (so it's very purpose-specific).

This does not matter typically since you'd just pretrain on the data that works. However, “[t]he CodeContests training set has a non-empty intersection with the APPS test set, and therefore CodeContests cannot be used during training when evaluating on the APPS benchmark.” This is purely an evaluation issue; leakage doesn't matter so much in production.


>> No, the metric used in this paper was the percentage of questions it could solve against the hidden tests.

Right, that's my mistake. The APPS dataset has natural language specifications and test cases for evaluation. It actually includes Codeforces problems.

The excuse quoted in the second part of your comment is an excuse. If a large language model can complete a code generation task, that's because it's seen an example of the code it's asked to generate before. Any claims to the contrary need very strong evidence to support them and there's typically no such thing in papers like the AlphaCode one.


Your comment is an excuse, not mine: “ignore how good the advertised model is, because a different, much smaller version without all the techniques merely does pretty well on an extremely tough problem set.”

> Any claims to the contrary need very strong evidence to support them and there's typically no such thing in papers like the AlphaCode one.

This is the opposite of how burden of proof works. You are the one making a claim with certainty based off of guesswork, not me. And the paper does actually have a section on this, and finds copying isn't pervasive outside of utility snippets and functions, which also occur in human solutions. It's a weak objection anyway; just the task of translating english prose into the general algorithm you want to apply is already an impressive feat.


>> “ignore how good the advertised model is, because a different, much smaller version without all the techniques merely does pretty well on an extremely tough problem set.”

Where is that quote from? Why are you quoting it? Am I supposed to reply to it?

>> You are the one making a claim with certainty based off of guesswork, not me.

I'm not talking about you. I'm talking about the paper, the team behind it and work on large language models trained on online data, in general.

The paper indeed makes a vague claim of "guarding" against data leakage by a "strict temporal split" which means they ensured that the validation and test data used for fine-tuning was not available to the model. That of course doesn't mean much. What matters is if the data on which the model was trained included programs like the ones the model was asked to generate. Clearly, it did, otherwise the model would not have been able to generate any programs that could be used as solutions to the test problems.

And I think you have the rule on the burden of proof a bit wrong. I don't have to prove anything that is already well-known. For instance, if I said that gravity makes things fall down, I wouldn't bear any burden of proof. Accordingly, there is no doubt that neural nets can only represent what is in their training set. That's how neural nets work: they model their training data. They can't model data that is not in their training set. It wouldn't even be fair to expect a neural net to learn to represent data that it wasn't trained on, and just to be clear, I'm not saying that there should be such an expectation, or that it is even desirable. This modelling ability of neural nets is useful. In fact, this is the real strength of neural nets, they are extremely good at modelling. I mean, duh! Why are we even discussing this?

But this is something that the deep learning community is trying to deny, to itself primarily, it seems. Which is exceedingly strange. Work like the one linked above prefers to make bizarre claims about reasoning abilty that we are, presumably, expected to believe arises magickally just by training on lots of data, as if there's a threshold of volume above which data is miraculously transsubstantiated into an element with quite different propeties, from which reasoning or "critical thinking" (dear god) emerges even in the complete absence of anything remotely like a reasoning mechanism. This is nonsense. Why not admit that in order for a large language model to be able to generate code, it must see code "like" the one it's asked to generate? Then we can talk about what "like" means, which is the interesting question. All this attempt to pussyfoot around what those systems are really doing is so counter-productive.

Again, this is not about anything you specifically say, but a criticism of deep learning reserach in general. I don't presume you're a deep learning researcher.

>> It's a weak objection anyway; just the task of translating english prose into the general algorithm you want to apply is already an impressive feat.

Not as impressive as you think. The problem descriptions used on CodeForces etc are not arbitrary English prose. They don't ask participants to write a poem about Spring (and I don't mean the old Java library). So it's not "prose" but very precise specifications. They could be represented as a Controlled Natural Language. So something much easier to model than arbitrary English.

And, yet again, the performance of the model is crap.


I was paraphrasing your argument. AFAICT, it remains faithful to what you are saying.

> Accordingly, there is no doubt that neural nets can only represent what is in their training set.

This is not true. If it were true that there was no doubt, the paper wouldn't have challenged it and claimed it false. If you assume your conclusion well obviously your conclusion follows trivially.

> And, yet again, the performance of the model is crap.

It isn't.


>> I was paraphrasing your argument. AFAICT, it remains faithful to what you are saying.

No, it remains faithful to your interpretation of what I said, which is designed to support your opinion rather than mine.

Also, basic manners: if you're not quoting, don't use quotes.

>> It isn't.

Is too!

We could do that all day. Or, we could look at the reported results which are, well, crap.


> It generates millions of candidate programs and then it "filters" them by running them against input-output examples of the target programs taken from the problem descriptions. The filtering still leaves thousands of candidate programs (because there are very few I/O examples and the almost random generation can generate too many programs that pass the tests, but still don't solve the problem) so there's an additional step of clustering applied to pare this down to 10 programs that are finally submitted. Overall, that's a brute-force, almost random approach that is ignoring entire decades of program synthesis work. To make an analogy, it's as if DeepMind had just published an article boasting of its invention of a new sorting algorithm... bubblesort.

This is almost certainly untrue and if this position were true, it would be extremely easy for you to prove it: just write a program-generating algorithm that solves even some of the easiest Codeforces problems. Since you're claiming this feat by Alphacode is comparable in difficulty to writing bubblesort (which you could write in 5 minutes), it shouldn't take you a lot of effort to produce something comparable. Just link your program-generating algorithm here with something like instructions on how to use it, and link a few Codeforces submissions were it got ACC result.


To clarify, which part of my comment do you think is "almost certainly untrue"? What I describe above is how their approach works. It's summarised in the paper. See Section 4. ("Approach"):

1. Pre-train a transformer-based language model on GitHub code with standard language modelling objectives. This model can reasonably represent the space of human coding, which greatly reduces the problem search space.

2. Fine-tune the model on our dataset of competitive programming data, using GOLD (Pang and He, 2020) with tempering (Dabre and Fujita, 2020) as the training objective. This further reduces the search space, and compensates for the small amount of competitive programming data by leveraging pre-training.

3. Generate a very large number of samples from our models for each problem.

4. Filter the samples to obtain a small set of candidate submissions (at most 10), to be evaluated on the hidden test cases, by using the example tests and clustering to pick samples based on program behaviour.

>> Since you're claiming this feat by Alphacode is comparable in difficulty to writing bubblesort (which you could write in 5 minutes), it shouldn't take you a lot of effort to produce something comparable.

What I meant was that the way they announced AlphaCode is like claiming that bubblesort is a novel approach to sorting lists. Not that the effort needed to create their system is comparable to bubblesort. I think if you read my comment again more carefully you will find that this is the first interpretation that comes to mind. Otherwise, I apologise if my comment was unclear.


Aren't you missing the point that even though success percentage is low, it is still same as estimated average human performance? So it is impressive the overall system (however cluncky it is) is indeed able to match human performance. If you don't find this impressive, do you have any other example of system that exceeds this performance?


To clarify, the low percentage of 25% of correct solutions is on the APPS dataset, not against human coders. See table 10 (page 21 of the pdf) on the arxiv paper if you are unsure about the difference:

https://storage.googleapis.com/deepmind-media/AlphaCode/comp...

Evaluation against the average competitor on Codeforces is not the "estimated average human performance", it's only the average of the coders on Codeforce who are an unknown proportion of all human coders with an unknowable level of coding ability. So evaluating against that is actually a pretty meaningless metric.

The benchmarking against APPS is much more meaningful but the results are pretty poor and so they are omitted from the article above.

So, no. I'm not missing the point. Rather, the article above is eliding the point: which is that on the one meaningful evluation they attempted, their system sucks.

Edit: Here's table 10, for quick reference:

                       Filtered From (k)  Attempts (k)  Introductory  Interview  Competition
                                                        n@k           n@k        n@k
  GPT-Neo 2.7B         N/A                1             3.90%         0.57%      0.00%
  GPT-Neo 2.7B         N/A                5             5.50%         0.80%      0.00%
  Codex 12B            N/A                1             4.14%         0.14%      0.02%
  Codex 12B            N/A                5             9.65%         0.51%      0.09%
  Codex 12B            N/A                1000          25.02%        3.70%      3.23%
  Codex 12B            1000               1             22.78%        2.64%      3.04%
  Codex 12B            1000               5             24.52%        3.23%      3.08%
  AlphaCode 1B         N/A                1000          17.67%        5.24%      7.06%
  AlphaCode 1B         1000               5             14.36%        5.63%      4.58%
  AlphaCode 1B         10000              5             18.18%        8.21%      6.65%
  AlphaCode 1B         50000              5             20.36%        9.66%      7.75%
And its caption:

Table 10 | n@k results on APPS. If there is no filtering, then n = k and the metric is pass@k. Finetuned GPT-Neo numbers reported from Hendrycks et al. (2021), Codex numbers from Chen et al. (2021). We used a time limit of 3 seconds per test to match Codex 12B, and report average numbers over 3 different fine-tuning runs for AlphaCode.

Edit 2: And now that I posted this, I note that the 25% solutions are from Codex. AlphaCode's best result was 20%.


You can find the rating distribution filtered for >5 contests here: https://codeforces.com/blog/entry/71260

I am rated at 2100+ so I do agree that 1300 rating is low. But at the same time it solved https://codeforces.com/contest/1553/problem/D which is rated at 1500 which was actually non-trivial for me already. I had one wrong submit before getting that problem correct and I do estimate that 50% of the regular competitors (and probably the vast majority of the programmers commenting in this thread right now) should not be able to solve it within 2hrs.


1553D is a quite confusing case though.

On the AlphaCode Attention Visualization website [1], the Accepted code shown for 1553D is a O(n^2) Python one, which is supposed to be TLE. It correctly implements a two-pointer solution, but failed to "realize" that list.pop(0) is O(n) in Python. I'm not sure how it passed.

[1] https://alphacode.deepmind.com/#layer=30,problem=34,heads=11...


Likely the python runtime has a strange string implementation for cases like this, just like javascript strings.


It does not. Really the strings just never get long enough that O(n²) would be catastrophic; the maximum possible length is 2e5.


2e5 is enough for making a naive O(n^2) solution to get TLE.

This is likely due to the fact that in AlphaCode's solution the "inner O(n) loop" is actually a memmove(), which is optimized to be insanely fast.


> AlphaCode's solution the "inner O(n) loop" is actually a memmove(), which is optimized to be insanely fast.

Again, it is not. CPython does not do these things.

The web page says, and this is corroborated in the paper,

> Solutions were selected randomly, keeping at most one correct (passes all test cases in our dataset) and one incorrect sample per problem and language. Note that since our dataset only has a limited number of test cases, passing all tests we have cannot completely rule out false positives (~4%), or solutions that are correct but inefficient (~46%).

The “54th percentile” measure did use estimated time penalties, which you can see discussed in Table 4 in the paper, but 1553D was not part of that.


> CPython does not do these things.

Again, it is.

https://github.com/python/cpython/blob/2d080347d74078a55c477...

This is the memmove() I mentioned above. Like, I actually perf-d the code and confirmed this is in the hot loop.

> but 1553D was not part of that.

Someone submitted this 1553D code to Codeforces and it passed: https://codeforces.com/contest/1553/submission/144971343


Apologies, I thought you meant ‘optimized’ in a different sense, not in terms of how list.pop is implemented, as AlphaCode wasn't involved in that. You are entirely correct that list.pop uses memmove.

> Someone submitted this 1553D code to Codeforces and it passed

Ah, well that shows you have a 2 second time limit, which is quite a lot of time! Not quite enough to empty a 200k element list with list.pop(0)s, but not far off; a 140k element list squeaks in under the time limit for me.


The proposed O(N²) solution contains many unnecessary operations, e.g. the creation of list c or reversal of the input strings. Maybe it has been copied from a related problem? You can easily solve the task with half as many lines in O(N).

    for _ in range(int(input())):
        a = list(input())
        b = list(input())
        while a and b:
            if a[-1] == b[-1]:
                a.pop()
                b.pop()
            else:
                a.pop()
                if a: a.pop()
        print("NO" if b else "YES")


> But at the same time it solved https://codeforces.com/problemset/problem/1553/D

To be fair, it generated a set of (10) possible solutions, and at least one of them solved the problem.


I'm trying to solve this for fun, but I'm stuck! I've got a recursive definition that solves the problem by building a result string. I think it's a dynamic programming problem, but right now I can't see the shared sub-problems so :). Some real sour cherries being experienced from not getting this one!


  from collections import defaultdict
  def backspace(s1,s2):
      h = defaultdict(lambda:0)
      for x in s1:
          h[x] = h[x] + 1
      for x in s2:
        h[x] = h[x] - 1
      j = 0
      maxj = len(s2) - 1
      for x in s1:
        if x != s2[j]:
            h[x] -= 1
        elif j < maxj:
                j += 1
        else:
            break
    return j == maxj and all(y >= 0 for y in h.values())

  def random_backspace(s1):
    res = []
    for x in s1:
        if randint(0,1) == 0:
            res.append(x)
    return "".join(res)

  def backspaceTest(s1):
    return all(backspace(s1,random_backspace(s1)) for _ in  range(100))


For comparison, I used to be a very average, but pretty regular user about 5 years ago. I could reliably solve easiest 2 out of 5 problems, 3 in my lucky days.

My rating is 1562.


I find almost every new advance in deep learning is accompanied by contrasting comments: it's either "AI will soon automate programming/<insert task here>", or "let me know when AI can actually do <some-difficult-task>". There are many views on this spectrum, but these two are sure to be present in every comment section.

IIUC, AlphaCode was trained on Github code to solve competitive programming challenges on Codeforces, some of which are "difficult for a human to do". Suppose AlphaCode was trained on Github code that contains the entire set of solutions on Codeforces, is it actually doing anything "difficult"? I don't believe it would be difficult for a human to solve problems on Codeforces when given access to the entirety of Github (indexed and efficiently searchable).

The general question I have been trying to understand is this: is the ML model doing something that we can quantify as "difficult to do (given this particular training set)"? I would like to compute a number that measures how difficult it is for a model to do task X given a large training set Y. If the X is part of the training set, the difficulty should be zero. If X is obtained only by combining elements in the training, maybe it is harder to do. My efforts to answer this question: https://arxiv.org/abs/2109.12075

In recent literature, the RETRO Transformer (https://arxiv.org/pdf/2112.04426.pdf) talks about "quantifying dataset leakage", which is related to what I mentioned in the above paragraph. If many training samples are also in the test set, what is the model actually learning?

Until deep learning methods provide a measurement of "difficulty", it will be difficult to gauge the prowess of any new model that appears on the scene.


> Suppose AlphaCode was trained on Github code that contains the entire set of solutions on Codeforces, is it actually doing anything "difficult"?

They tested it on problems from recent contests. The implication being: the statements and solutions to these problems were not available when the Github training set was collected.

From the paper [0]: "Our pre-training dataset is based on a snapshot of selected public GitHub repositories taken on 2021/07/14" and "Following our GitHub pre-training dataset snapshot date, all training data in CodeContests was publicly released on or before 2021/07/14. Validation problems appeared between 2021/07/15 and 2021/09/20, and the test set contains problems published after 2021/09/21. This temporal split means that only information humans could have seen is available for training the model."

At the very least, even if some of these problems had been solved exactly before, you still need to go from "all of the code in Github" + "natural language description of the problem" to "picking the correct code snippet that solves the problem". Doesn't seem trivial to me.

> I don't believe it would be difficult for a human to solve problems on Codeforces when given access to the entirety of Github (indexed and efficiently searchable).

And yet, many humans who participate in these contests are unable to do so (although I guess the issue here is that Github is not properly indexed and searchable for humans?).

[0] https://storage.googleapis.com/deepmind-media/AlphaCode/comp...


> They tested it on problems from recent contests. The implication being: the statements and solutions to these problems were not available when the Github training set was collected.

Yes, and I would like to know how similar the dataset(s) were. Suppose the models were trained only on greedy algorithms and then I provided a dynamic programming problem in the test set, (how) would the model solve it?

> And yet, many humans who participate in these contests are unable to do so (although I guess the issue here is that Github is not properly indexed and searchable for humans?).

Indeed, so we don't know what "difficult" means for <human+indexed Github>, and hence we cannot compare it to <model trained on Github>.

My point is, whenever I see a new achievement of deep learning, I have no frame of reference (apart from my personal biases) of how "trivial" or "awesome" it is. I would like to have a quantity that measures this - I call it generalization difficulty.

Otherwise the datasets and models just keep getting larger, and we have no idea of the full capability of these models.


> Suppose the models were trained only on greedy algorithms and then I provided a dynamic programming problem in the test set, (how) would the model solve it?

How many human beings do you personally know who were able to solve a dynamic programming problem at first sight without ever having seen anything but greedy algorithms?

Deepmind is not claiming they have a machine capable of performing original research here.

Many human programmers are unable to solve DP problems even after having them explained several times. If you could get a machine that takes in all of Github and can solve "any" DP problem you describe in natural language with a couple of examples, that is AI above and beyond what many humans can do, which is "awesome" no matter how you put it.


> that is AI above and beyond what many humans can do, which is "awesome" no matter how you put it.

That's not the point being made. The point OP is making is that it is not possible to understand how impressive at "generalizing" to uncertainty a model is if you don't know how different the training set is from the test set. If they are extremely similar to each other, then the model generalizes weakly (this is also why the world's smartest chess bot needs to play a million games to beat the average grandmaster, who has played less than 10,000 games in her lifetime). Weak generalization vs strong generalization.

Perhaps all such published results should contain info about this "difference" so it becomes easier to judge the model's true learning capabilities.


I guess weaker generalisation is why it's better though. It converges slower but in the end it knowledge is more subtle. So my bet is more compute and programing and math is "solved" - not in research sense but very helpful "copilot".

The real fun will begin once someone discovers how to make any problem differentiable so try/error method isn't needed. I suggest watching recent Yann Le Cun interview. This will solve researching as well.


> How many human beings do you personally know who were able to solve a dynamic programming problem at first sight without ever having seen anything but greedy algorithms?

Zero, which is why if a trained network could do it, that would be "impressive" to me, given my personal biases.

>. If you could get a machine that takes in all of Github and can solve "any" DP problem you describe in natural language with a couple of examples, that is AI above and beyond what many humans can do, which is "awesome" no matter how you put it.

I agree with you that such a machine would be awesome, and AlphaCode is certainly a great step closer towards that ideal. However, I would like to have a number measures the "awesomeness" of the machine (not elo rating because that depends on a human reference), so I will have something as a benchmark to refer to when the next improvement arrives.


I understand wanting to look at different metrics to gauge progress, but what is the issue with this?

> not elo rating because that depends on a human reference


The Turing Test (https://en.wikipedia.org/wiki/Turing_test) for artificial intelligence required the machine to convince a human questioner that it was a human. Since then, most AI methods rely on a human reference of performance to showcase their prowess. I don't find this appealing because:

1) It's an imprecise target: believers can always hype and skeptics can always downplay improvements. Humans can do lots of different things somewhat well at the same time, so a machine beating human-level performance in one field (like identifying digits) says little about other fields (like identifying code vulnerabilities).

2) ELO ratings, or similar metrics are measurements of skill, and can be brute-forced to some extent, equivalent to grinding up levels in a video game. Brute-forcing a solution is "bad", but how do we know a new method is "better/more elegant/more efficient"? For algorithms we have Big-O notation, so we know (brute force < bubble sort < quick sort), perhaps there is an analogue for machine learning.

I would like performance comparisons that focus on quantities unique to machines. I don't compare the addition of computer processors with reference to human addition, so why not treat machine intelligence similarly?

There are many interesting quantities with which we can compare ML models. Energy usage is a popular metric, but we can also compare the structure of the network, the code used, the hardware, the amount of training data, the amount of training time, and the similarity between training and test data. I think a combination of these would be useful to look at every time a new model arrives.


Using my previous chess analogy, the world's smartest chess bot has played a million games to beat the average grandmaster, who has played less than 10,000 games in her lifetime. So while they both will have the same elo rating, which is a measure of how well they are at the narrow domain of chess, there is clearly something superior about the how the human grandmaster learns from just a few data points i.e. strong generalization vs the AI's weak generalization. Hence the task-specific elo rating does not give enough context to understand how well a model adapts to uncertainty. For instance - a Roomba would beat a human hands down if there was an elo rating for vacuuming floors.


> The implication being: the statements and solutions to these problems were not available when the Github training set was collected.

But similar ones were, because the amount of code puzzles suitable for such contests is finite. There are differences, but when you have seen a few string-compare problems, you have a pretty good grasp of what they look like, what common paths to solutions are, etc.


The example problem (essentially, is T a subsequence of S with deletions of size N) is a classic problem with no doubt dozens of implementations in AlphaCode's training set.

And yet, what a garbage solution it produces.

To illustrate the difference between intelligence and regurgitation, someone tell me what CoPilot generates for this:

  // A Go function to swap the sixth bit and seventeenth bit of a 32-bit signed integer.
Here is a human solution:

  func swap(x int32) int32 {
      const mask = 1 << 5
      var (
          xor1 = (x>>11 ^ x) & mask
          xor2 = xor1 << 11
      )
      return x ^ xor1 ^ xor2
  }
CoPilot cannot reason numerically like this (understand "seventeenth bit" and "sixth bit" and generate the right code for that combination). It needs to understand the size of the gap between the bits, i.e., 11, and that's too hard.


Here is what CoPilot wrote in Python:

  # A function to swap the sixth bit and seventeenth bit of a 32-bit signed integer.
  def swap_bits(x):
      # Get the value of the sixth bit.
      bit6 = x & (1 << 5)
  
      # Get the value of the seventeenth bit.
      bit17 = x & (1 << 16)
  
      # Swap the values of the sixth and seventeenth bit.
      bit6 = bit6 >> 5
      bit17 = bit17 >> 16

      # Combine the swapped values of the sixth and seventeenth bit.
      x = x ^ (bit6 << 16)
      x = x ^ (bit17 << 5)

      return x
It actually synthesised about 10 viable solutions. Some interesting ones:

  def swap_six_seventeen(x):
      # Get the binary representation of the integer.
      binary = bin(x)[2:]
      # Add zeros to the beginning of the binary representation.
      binary = '0' * (32 - len(binary)) + binary
      # Swap the sixth and seventeenth bit.
      binary = binary[:5] + binary[17] + binary[5:17] + binary[18:]
      # Convert the binary back to an integer.
      return int(binary, 2)


The first one (swap_bits) sets both bits to the same value, which is the original two bits XORed together. Eg.

    bin(swap_bits(0b_1_0000000000_0_00000))
    '0b10000000000100000'

    bin(swap_bits(0b_0_0000000000_1_00000))
    '0b10000000000100000'

    bin(swap_bits(0b_1_0000000000_1_00000))
    '0b0'

    bin(swap_bits(0b_0_0000000000_0_00000))
    '0b0'
The second one converts the value to a string and uses string operations, which is wildly inefficient and a very common mistake made by inexperienced programmers unaware of bitwise operations (so presumably common in the training set). It also attempts to swap the 6th and 17th most significant bits rather than the 6th and 17th least significant bits, i.e. counts in the opposite direction to the first one (the comment doesn't specify but typically you count from the least significant bit in these situations).

Worse, though, it gets the string manipulation completely wrong. I think it's trying for `binary[:5] + binary[16] + binary[6:16] + binary[5] + binary[17:]`, i.e. characters 1-5, then character 17, then characters 7-16, then character 6, then characters 18-32. The manipulation it does just completely mangles the string.

I'm very keen to try Github Copilot if they ever admit me to the beta (I've been waiting forever) and will adopt it enthusiastically if it's useful. However, this is exactly what I've pessimistically expected. Analysing these truly awful implementations to identify the subtle and bizarre misbehaviours has taken me far, far longer than it would have taken me to just write and test a working implementation myself. And I'm supposed to evaluate 10 of these to see if one of them might possibly do the right thing?!?!


The first example is almost correct, conditioned off a sentence description. The second example is the right idea, it just bit off more than it could chew when slicing it all together. Using string ops for binary manipulation in Python isn't even stupid; it can be faster in a lot of cases.

This feels a lot like screaming at a child for imperfect grammar.


You're misunderstanding my point. Nobody's screaming at anything. Whether this thing is impressive isn't at issue. It's utterly astonishing.

I'm trying to figure out whether copilot in its current form is a tool that will be useful to me in my job. (I'd be able to do this evaluation properly if they'd just let me on the damned beta.)

Nearly right isn't good enough for this afaics. In fact, I expect there to be a slightly paradoxical effect where nearly-right is worse than obviously-wrong. An analysis of a piece of code like I did above is time consuming and cognitively taxing. An obviously wrong solution I can just reject immediately. An almost-right (or at least vaguely plausible) one like these takes thought to reject. Much more thought, in this case (for me, at least) than just writing the thing myself in the first place.

Edit: BTW, I don't get what you're saying with

"The first example is almost correct, conditioned off a sentence description. The second example is the right idea, it just bit off more than it could chew when slicing it all together."

The first one is completely (if subtly) wrong. It's supposed to swap two bits but it sets them to the same value. There's no interpretation of the description in which that's correct.

The second one is definitely not "the right idea". It tries to do it with string manipulations, which (regardless of the fact that it does so incorrectly) is completely the wrong approach. This one is actually "better" than the other in the paradoxical sense I mentioned above, because I could reject it the moment I saw it convert the number to a string.


> The second one is definitely not "the right idea". It tries to do it with string manipulations, which (regardless of the fact that it does so incorrectly) is completely the wrong approach. This one is actually "better" than the other in the paradoxical sense I mentioned above, because I could reject it the moment I saw it convert the number to a string.

In this case string ops are a worse idea, but as I said before, this is not generally true of Python, at least when using CPython. Eg. the string method is significantly the faster in this example:

    # https://stackoverflow.com/a/20918545/1763356
    def reverse_mask(x):
        x = ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1)
        x = ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2)
        x = ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4)
        x = ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8)
        x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16)
        return x

    # My ver
    def reverse_format(x):
        return int(f"{x:032b}"[::-1], 2)
Python's dynamic object overhead (and to a lesser extent, interpreter overhead) makes a lot of seemingly-expensive operations not matter very much.


Well, that also seems like the wrong question to ask. Whether it's currently useful to you for writing short algorithms, rather than as the non-programmer's API interface it's primarily marketed as, seems about the least interesting take-away for it. We'll get to smoothing over the cracks later, once it's not a capability we literally just discovered exists. Heck, Codex is already not SOTA for that, AlphaCode is.


It may not be the question that interests you but who are you to say it's the "wrong question" for me to ask? I want to know if I'm on the verge of having access to a tool that is going to transform the way I do my job, as people keep claiming.


It illustrates that CoPilot is generating maximum likelihood token strings and has no real understanding of the code.

That's what is happening here. There is no intelligence, just regurgitation. Randomization and maximum likelihood completion.

Just like with the competitive programming example, we're asking it to produce solutions that it has seen in its training set. If you ask for a nontrivial twist on one of those solutions, it fails.


>It illustrates that CoPilot is generating maximum likelihood token strings and has no real understanding of the code.

Funny, today I was just thinking of people's tendencies to dismiss AI advances with this very pattern of reasoning: take a reductive description of the system and then dismiss it as obviously insufficient for understanding or whatever the target is. The assumption is that understanding is fundamentally non-reductive, or that there is insufficient complexity contained within the reductive description. But this is a mistake.

The fallacy is that the reductive description is glossing over the source of the complexity, and hence where the capabilities of the model reside. "Generating maximum likelihood token strings" doesn't capture the complexity of the process that generates the token strings, and so an argument that is premised on this reductive description cannot prove the model deficient. For example, the best way to generate maximum likelihood human text is just to simulate a human mind. Genuine understanding is within the solution-space of the problem definition in terms of maximum likelihood strings, thus you cannot dismiss the model based on this reductive description.


The difference between me and you is that I implement neural nets professionally. Here is one of my (non-professional) open source projects: https://NN-512.com

I'm sure if you understood what the transformer was doing, you would be less impressed.


This is the wrong context to go with an appeal to authority. I know what the transformer is doing, I've also developed neural networks before (though not professionally). Your experience is working against you in developing your intuition. There's another common fallacy that because we're somehow "inside" the system, that we understand exactly what is going on, or in this case what isn't going on. Language models are composed of variations of matrix multiplications, but that isn't a complete description of their behavior. It's like saying because we've looked inside the brain and there's just electrical and chemical signals, the mind must reside somewhere else. It's just a specious argument.


It got the value of the sixth and seventeenth bits, moved them into the right positions, and inserted them into the original value. Off a one-line description written in English! I really cannot empathize with the idea that this is not a meaningful capability. If intelligence only means to you “equal in all capabilities to an experienced human”, you are never going to be able to see anything coming ever.


If you ask CoPilot to solve something it hasn't seen, it won't be able to solve it.

It's a transformer. Do you understand what that means? It's just matrix multiplication.

It generates maximum likelihood token strings, based on its training data.

It doesn't "understand" what those token string mean.

You are amazed because you're testing the transformer by asking the transformer to generate human-written code THAT IT WAS TRAINED ON. To make CoPilot fail, all you have to do is ask it to generate something unlikely, something it hasn't seen in training.

Maximum likelihood token strings. Period.


All wrong. These "solutions" are not correct.


The xors are wrong. Stick in the integer 1<<16.


Ask and you shall receive, here is the output it gave me: // Write a Go function to swap the sixth bit and seventeenth bit of a 32-bit signed integer.

    package main
    
    import (
     "fmt"
     "math"
    )
    
    func main() {
     var a int32 = 0b1010101010101010101010101010101010101010101010101010101010101010
     fmt.Printf("%b\n", a)
     fmt.Printf("%b\n", swapBits(a, 6, 17))
    }
    
    func swapBits(a int32, i int, j int) int32 {
     // convert to binary
     bin := fmt.Sprintf("%b", a)
     // get the bits
     bit1 := bin[i-1 : i]
     bit2 := bin[j-1 : j]
     // swap the bits
     bin = bin[:i-1] + bit2 + bin[i:]
     bin = bin[:j-1] + bit1 + bin[j:]
     // convert back to int
     return int32(bin2int(bin))
    }
    
    func bin2int(bin string) int64 {
     var sum int64
     for i, v := range bin {
      if v == '1' {
       sum += int64(math.Pow(2, float64(len(bin)-i-1)))
      }
     }
     return sum
    }


Ridiculous. It's a joke, right? Perhaps the most inefficient and naive solution ever?

CoPilot is regurgitating some solution from its training set, the solution of an inept programmer who would manipulate bits via conversion to string... yikes.


The next iteration of code assistant needs to be able to parse responses like your comment and update the code accordingly. Once a human+computer pair can converge on a correct and admissible solution to _any_ tractable programming task through natural language dialogue, we should start worrying about our jobs going away. Until then, for each line of code generated by AI, there will be two jobs created to maintain that code.


Copilot can do that, sorta. You undo the completion and add something like "... but don't convert it to a string" to the comment, then have it try completing again.


Which direction in feature space do you move in response to "you inept POS"?


You can do it without a subtraction

     unsigned int swapbits(unsigned int a) {
     bool bit6 = a & (1 << 5); bool bit17 = a & (1 << 16); 
    if (bit6 == bit17) return a; //bits are the same, do nothing
     return (a ^ (1 << 5) ^ (1 << 16)); 
     // flip both 6th and 17th bits }


And, to be clear, this is a human solution.

Not as efficient as mine, but kudos.


The compiler seems to generate less efficient code than either if you write the most mechanical solution for swapping the bits in C.

gcc and clang give

   swap:                                   # @swap
        mov     ecx, edi
        shr     ecx, 11
        and     ecx, 32
        mov     eax, edi
        and     eax, -65569
        or      eax, ecx
        and     edi, 32
        shl     edi, 11
        or      eax, edi
        ret

   swap:
        mov     eax, edi
        mov     edx, edi
        and     edi, -65569
        sal     eax, 11
        shr     edx, 11
        and     eax, 65536
        and     edx, 32
        or      eax, edx
        or      eax, edi
        ret

   /* only works on little-endian! */
    typedef union
   {
 struct
 {
  unsigned bit1: 1;   unsigned bit2: 1;
  unsigned bit3: 1;   unsigned bit4: 1;
  unsigned bit5: 1;   unsigned bit6: 1;
  unsigned bit7: 1;   unsigned bit8: 1;
  unsigned bit9: 1;   unsigned bit10: 1;
  unsigned bit11: 1;   unsigned bit12: 1;
  unsigned bit13: 1;   unsigned bit14: 1;
  unsigned bit15: 1;   unsigned bit16: 1;
  unsigned bit17: 1;   unsigned bit18: 1;
  unsigned bit19: 1;   unsigned bit20: 1;
  unsigned bit21: 1;   unsigned bit22: 1;
  unsigned bit23: 1;   unsigned bit24: 1;
  unsigned bit25: 1;  unsigned bit26: 1; 
  unsigned bit27: 1;   unsigned bit28: 1;
  unsigned bit29: 1;   unsigned bit30: 1;
  unsigned bit31: 1;   unsigned bit32: 1;
 };
 unsigned int n; } mybits;

   unsigned int swap(unsigned int n)
   {
    mybits foo;
    foo.n = n;
    unsigned tmp = foo.bit6;
    foo.bit6 = foo.bit17;
    foo.bit17 = tmp;
    return foo.n;
   }


Would we be able to generate unit tests? Strikes me that this would be important to verify given that we didn't even "write" the code. At some point we might not even be looking at the generated code? I almost guarantee that's what is going to happen eventually.


You can see it happening already.

Solutions are posted, and they're wrong.

But the CoPilot user can't see the code is wrong.


There's really no need for an 11 in the code. I'd say that makes the code worse, not better.


This is a toy problem to illustrate that CoPilot cannot write code that requires mathematical reasoning. It regurgitates solutions from the training set, via a mixed internal reresentation.


   unsigned int swapbits(unsigned int a)
   {
       bool bit6 = a & (1 << 5);
       bool bit17 = a & (1 << 16);
       if (bit6 == bit17) return a; //bits are the same, do nothing
       return (a ^ (1 << 5) ^ (1 << 16)); // flip both 6th and 17th bits
   }


Gross and not portable C99.

    #define B6 (1<<5)
    #define B17 (1<<16)

    unsigned swapbits(unsigned a) {
       return ((a & B6 == a & B17) ? a : (a ^ (B6 | B17)));
    }

Here's some BFP:

    unsigned swapbits(unsigned a) {
       unsigned flip = (a & B6 == a & B17);
       return (a ^ ((flip<<5) | (flip<<16)));
    }

int and double are C's implicit lingua francas for underspecified literals and implicit type conversions. Throwing int everywhere is redundant like "ATM machine."


The definition of flip requires parenthesis (a & B6) == (a & B17) as == has higher precedence than and. int is required in C++ but not in C as you said.


What requires mathematical reasoning? Getting or setting the nth bit? Or swapping two variables? What am I missing?


At the risk of sounding relentlessly skeptical - surely by training the code on GitHub data you're not actually creating an AI to solve problems, but creating an extremely obfuscated database of coding puzzle solutions?


We validated our performance using competitions hosted on Codeforces, a popular platform which hosts regular competitions that attract tens of thousands of participants from around the world who come to test their coding skills. We selected for evaluation 10 recent contests, each newer than our training data. AlphaCode placed at about the level of the median competitor, marking the first time an AI code generation system has reached a competitive level of performance in programming competitions.

[edit] Is "10 recent contests" a large enough sample size to prove whatever point is being made?


The test against human contestants doesn't tell us anything because we have no objective measure of the ability of those human coders (they're just the median in some unknown distribution of skill).

There's more objective measures of performance, like a good, old-fashioned, benchmark dataset. For such an evaluation, see table 10 in the arxiv preprint (page 21 of the pdf), listing the results against the APPS dataset of programming tasks. The best performing variant of AlphaCode solves 25% of the simplest ("introductory") APPS tasks and less than 10% of the intermediary ("interview") and more advanced ones ("competition").

So it's not very good.

Note also that the article above doesn't report the results on APPS. Because they're not that good.


Does it need to solve original problems? Most of the code we write is dealing with the same problems in a slightly different context each time.

As others say in commends it might be the case where we meet in the middle. Us writing some form of tests for AI-produced code to pass.


That’s been a common objection to Copilot and other recent program synthesis papers.

The models regurgitate solutions to problems already encountered in the training set. This is very common with Leetcode problems and seems To still happen with harder competitive programming problems.

I think someone else in this thread even pointed put an example of AlphaCode doing the same thing.


Between this and OpenAI's Github Copilot "programming" will slowly start dying probably. What I mean by that is that sure, you have to learn how to program, but our time will be spent much more on just the design part and writing detailed documentation/specs and then we just have one of these AIs generate the code.

It's the next step. Binary code < assembly < C < Python < AlphaCode

Historically its always been about abstracting and writing less code to do more.


First, If this is correct, if alpha code succeeded, this will bring to its own demise.

I.e. as soon as it starts replacing humans, it will not have enough human generated training data, since all of programming will be done by models like himself.

Second, alphacode was specifically trained for competitive programming :

1. short programs. 2. Each program has 100's of human generated solutions.

However, commercial program are:

1. long. 2. Have no predefined answer or even correct answer. 3. Need to use/reuse a lot of legacy code.


> as soon as it starts replacing humans, it will not have enough human generated training data, since all of programming will be done by models like himself.

As a natural born pessimist, I can't help but feel that by the time we get to that point we'll just keep blundering forward and adapting our world around the wild nonsense garbage code the model ends up producing in this scenario.

After all, that's basically what we've done with the entire web stack.


Reinforcement learning and adversarial training can render both of those concerns as non-issues in practice.


The phrase "in practice" doesn't really work when you're referring to highly finicky strategies like RL and adversarial training


My bet would be that it will never happen in a reasonable time frame. And also by that logic, writing that "documentation/spec" would just mean learning a new programming language the AI engine can parse making it as useful as a compiler. Anyone who has been writing and designing software for a while knows the cycle is way more complex than take some input and write code.

Let me know when the AI engine is able to do complex refactoring or adding features that keeps backwards compatibility, find a bug in a giant codebase by debugging a test case or write code that's performant but also maintainable.


I agree, from a totally different angle. Let's take something I know better as an example: Structural engineering. Structural engineering should be a "solved problem". It seems, ostensibly, relatively simple compared to a more open ended activity like "programming".(For "technical reasons", it ends up being more similar than you might think.) Still, you are ultimately dealing with the same materials, the same physics, and very similar configurations.

And yet, despite the fact that we have programs to help calculate all the things, test code-required load-combinations, even run simulations and size individual components... it turns out that, it doesn't actually save that much work, and you still need an engineer to do most of it. And not just because of regulatory requirements. It's just, that's not the hard part. The hard part is assembling the components and specifications, specifying the correct loads based on location-specific circumstances, coming up with coherent and sensible design ideas, chasing down every possible creative nook and cranny of code to make something that was originally a mistake actually work, and know when the model is just wrong for some reason and the computer isn't simulating load paths accurately.

Specifying the inputs and interpreting results is still about as much work as it was before you started with all the fancy tools. Those tools still have advantages mind you, and they do make one slightly more efficient. Substantially so in some cases, but most of the time it still comes out as a slight assist rather than a major automation.


As a former structural engineer, I completely agree with this sentiment. For every engineering project I was involved in, the automated components were at most 2 to 5% of the rest of the work.


I hear that.

Machine Learning also has a long way to go before it can take a long, rambling mess of a meeting and somehow generate a halfway usable spec from it. I mean, the customer says they want X, but X is silly in this context, so we'll give them Y and tell them it's "X-like, but faster". For example, SQL is "Blockchain-like, but faster" for a lot of buzzword use-cases of blockchain.


You ever notice how the "let me know when" part of this keeps changing? Let me know when computers can ... play Go/understand a sentence/compose music/write a program/ ...

But surely they'll never be able to do this new reference class you have just now come up with, right?


Not really? I mean I would never say "let me know when computer can do X" when X is something that doesn't require too much creativity and imagination. Like, a computer composing music, doesn't impress me too much because music itself has structure. A computer creating music that would wow a professional composer? That would be impressive. Same with this topic. A computer that solves some (because it failed several) short programming challenges and OP says it will kill programming entirely? Not even close. Pretty cool though.


It keeps changing since our imagination of what tasks requires intelligence are weak. We think that when a computer can do X it can also do Y. But then someone builds a computer that can do X but can't do Y, and we say "oh, so that doesn't require intelligence, let me know when it can do Z and we can talk again.". That doesn't mean that Z means the computer is intelligent, just that Z is a point where we can look at it and discuss again if we made any progress. What we really want is a computer that can do Y, but we make small mini tasks that are easier to test against.

The Turing test is a great example of this. Turing thought that a computer needs to be intelligent to solve this task. But it was solved by hard coding a lot of values and better understanding of human psychology and what kind of conversation would seem plausible when most things are hardcoded. That solution obviously isn't AI, I bet you don't think so either, but it still passed the Turing test.


At what point do we give up and realize that there is no one thing called intelligence, just a bunch of hacks that work pretty well for different things sometimes? I think that's probably where people keep failing here. The reason that we keep failing to find the special thing in every new field that AI conquers is because there's nothing special to actually find? I mean, we could keep moving the goalposts, a sort of intelligence of the gaps argument? But this doesn't seem productive.


Possibly interesting trivium: automated debugging was first described in 1982, in Ehud Shapiro's PhD thesis titled "Algorithmic Program Debugging" (it's what it sounds like and it can also generate programs by "correcting" an empty program):

https://en.wikipedia.org/wiki/Algorithmic_program_debugging

Of course all this targeted only Prolog programs so it's not well-known at all.


It's also the starting point for Inductive Logic Programming (as in Shapiro's "Model Inference System"), as I'm sure you know ;)


Let's say I'm aware of it :)


Solving competitive programming problems is essentially solving hard combinatorial optimization problems. Throwing a massive amount of compute and gradient descent at the problem has always been possible. If I'm not mistaken what this does is reduce the representation of the problem to a state where it can run gradient descent and then tune parameters. The real magic is in finding structurally new approaches. If anything I'd say algorithms and math continue to be the core of programming. The particular syntax or level of abstraction don't matter so much.


> Solving competitive programming problems is essentially solving hard combinatorial optimization problems.

True, but if you relax your hard requirements of optimality to admit "good enough" solutions, you can use heuristic approaches that are much more tractable. High quality heuristic solutions to NP-hard problems, enabled by ML, are going to be a big topic over the next decade, I think.


I should correct myself, this isn't even that. This is just text analysis on codeforces solutions, which makes it even worse than I thought. Very pessimistic about it's generalizability.


> If anything I'd say algorithms and math continue to be the core of programming.

I disagree; I think the core of programming is analyzing things people want and expressing solutions to those wants clearly, unambiguously, and in a way that is easy to change in the future. I'd say algorithms and math are a very small part of this work.


That's not programming, that's called being a good employee. Any person in any role should be doing that. Programming is about algorithms and math. Now a good employee who's in a technical role should have both.


> Programming is about algorithms and math.

You've simply restated your opinion without providing any supporting arguments, and as I already said, I disagree. The vast majority of programming I see (and as a consultant, I see a fairly wide variety) is not about algorithms and math, but instead gluing together systems and expressing domain logic.

Now, I suppose you could argue that domain logic is "algorithms and math," but in my experience, it's less about the specific algorithms and more about precisely describing fuzzy human behavior.

It's that "precisely describing" and "easy to change in the future" parts that makes what programmers do different than what any good employee does.

(I do agree that there is some programming that is focused on algorithms and math, but it's in the minority, in my experience. Perhaps the type of work you do is focused on algorithms and math, but I believe that's a relatively small part of the software development ecosystem.)


No I'm not talking about programming that requires calculations or programs written to solve mathematical problems. Programming at its core is about defining precise logical relationships between abstract objects and then writing algorithms to understand and modify these objects. This is a mathematical process and you should use mathematical thinking to do this.It may not always seem like it when the objects and relationships appear to be simple but that is the core of programming.


Creating a higher level abstraction is something people have been trying to do for decades with so-called 4th-generation languages. At some point, abstracting away too much makes a tool too cookie-cutter, and suddenly deviating from it causes more difficulty.


Maybe it's not more abstraction we need, just automating the drudgery. Abstractions are limited - by definition they abstract things away, they are brittle.


Read: Ruby on Rails


I'd note that assembly, C, and Python didn't replace 'programming' but were expected to do so. I'd wager that what you now call 'detailed documentation/specs' will still be called programming in 10 or even 20 years.


If you could change a sentence in the documentation and then run a ~1min compilation to see the resulting software, it would be a very different kind of programming. I suppose it'll give a new meaning to Readme-Driven-Development.


Model-driven development and code generation from UML were once supposed to be the future. It will be interesting to see how much further this approach takes us.

Assuming ANNs resemble the way human brain function you'd also expect them to introduce bugs. And so the actual humans beings would partake in debugging too.


I agree, I expect programmers will just move up the levels of abstraction. I enjoyed this recent blog post on the topic: https://eli.thegreenplace.net/2022/asimov-programming-and-th...


The "problem" is that as you move up the levels of abstraction, you need fewer people to do the same amount of work. Unless the complexity of the work scales as well. I've always felt that programmers would be the first class of knowledge workers to be put out of work by automation. This may be the beginning of the end for the programming gravy train.


> The "problem" is that as you move up the levels of abstraction, you need fewer people to do the same amount of work.

This will lower the entry barrier to developing software so more people will go into the field. Before you needed to know a programming language, now you will just have a dialogue with a language model.

> I've always felt that programmers would be the first class of knowledge workers to be put out of work by automation.

We've been automating our work for 70 years, and look how many programmers are employed now. The more we automate, the more capable our field becomes and more applications pop up.


>This will lower the entry barrier to developing software so more people will go into the field.

Indeed. The ideal future of programming is something out of star trek. I often noticed how everyone on the ship is a programmer of a sort, they whip up a simulation as the problem warrants regardless of their field. But in this future, the job of programmer basically doesn't exist. As a programmer, I should be allowed to have mixed feelings about that.


Let your imagination fly. We always want more than it's possible, our wishes fill up any volume like an expanding gas. Humans are going to be crucial to orchestrate AI and extract the most utility out of it.


> as you move up the levels of abstraction, you need fewer people to do the same amount of work

Yes, but the total amount of work (and surrounding complexity) also increases with it. Just look at the evolution of the software industry over the last few decades.


History isn't a great guide here. Historically the abstractions that increased efficiency begat further complexity. Coding in Python elides over low-level issues but the complexity of how to arrange the primitives of python remains for the programmer to engage with. AI coding has the potential to elide over all the complexity that we identify as programming. I strongly suspect this time is different.


Yes, this is how you increase prosperity (see: agricultural revolution, industrial revolution, etc). You can now create more with the same number of people.


On the other hand, as the value of an hour of programming increases, the quantity demanded may also increase.


Or you can do things at a faster pace and increase your productivity.


There aren't enough developers either way.


I disagree that programming is dying -- tools like Copilot will lead to a Renaissance in the art of computer programming by enabling a larger population to design programs and explore the implications of their design choices. I wrote a short essay [1] on the history automated programming and where I think it is heading in the future.

[1]: https://breandan.net/public/programming_with_intelligent_mac...


You also get to specialize harder. You’ll be able to move into more advanced programming styles. I’m thinking of formally verifiable C/C++ programs for safety critical applications, and code using advanced concepts from programming language theory.

The programming languages of the future are going to make Rust look like Python. That’ll be in part because you as an individual programmer aren’t weighed down by as much boilerplate as you were pre-copilot, pre-alphacode and pre- the more advanced coding assistants of the future.


> writing detailed documentation/specs

That's what code is.


I've been wondering this for a while:

In the future, code-writing AI could be tasked with generating the most reliable and/or optimized code to pass your unit tests. Human programmers will decide what we want the software to do, make sure that we find all the edge cases and define as many unit tests as possible, and let the AI write significant portions of the product. Not only that, but you could include benchmarks that pit AI against itself to improve runtime or memory performance. Programmers can spend more time thinking about what they want the final product to do, rather than getting mired in mundane details, and be guaranteed that portions of software will perform extremely well.

Is this a naive fantasy on my part, or actually possible?


> Is this a naive fantasy on my part, or actually possible?

Possible, yes, desirable, no.

The issue I have with all these end-to-end models is that they're a massive regression. Practitioners fought tooth and nails to get programmers to acknowledge correctness and security aspects.

Mathematicians and computer scientists developed theorem solvers to tackle the correctness part. Practitioners proposed methodologies like BDD and "Clean Code" to help with stability and reliability (in terms of actually matching requirements now and in the future).

AI systems throw all this out of the window by just throwing a black box onto the wall and scraping up whatever sticks. Unit tests will never be proof for correctness - they can only show the presence of errors, not their absence.

You'd only shift the burden from implementation (i.e. the program) to the tests. What you actually want is a theorem prover that proofs the functional correctness in conjunction with integration tests that demonstrate the runtime behaviour if need be (i.e. profiling) and references that link implementation to requirements.

The danger lies in the fact that we already have a hard time getting security issues and bugs under control with software that we should be able to understand (i.e. fellow humans wrote and designed it). Imagine trying to locate and fix a bug in software that was synthesised by some elaborate black box that emitted inscrutable code in absence of any documentation and without references to requirements.


It seems to me that writing an exhausting set of unit cases is harder than writing the actual code.


Otherwise the AI will just over-fit the unit test case subset.


First you need really good infra to make it easy to test working multiple solutions for AI but I think this will be bleeding edge in 2030.

EDIT: with in-memory DBs I can imagine AI assisted mainframe than can solve 90% of business problems.


And a second AI to generate additional test cases similar to yours (which you accept as also in scope) to avoid the first AI gaming the test.


How suprising did you guys find this? I'd have said there was a 20% chance of this performing at the median+level if I was asked to predict things beforehand.


I am surprised, as recently OpenAI had ~25% of easy problems and ~2% in competitive problems. Seems like DeepMind is ahead in this topic as well.

Actually I think Meta AI had some interesting discovery recently that could possibly improve NNs in genral, so probably this as well.

I am not in field but wonder if some other approaches like Tsetlin machines would be more useful for programming.


Somehow I have never heard of Tsetlin machines before this. Are you talking about this https://ai.facebook.com/blog/the-first-high-performance-self... result by MetaAI?


Probably not. Tsetlin machines have logic expressions instead of weights in NN, so it's easy to interpret them. I guess some meta algorithm could maybe work on top of them.

https://arxiv.org/abs/2102.10952

EDIT: Missread, I meant this from meta https://arxiv.org/abs/2105.04906 - not sure how much it's productised


I didn't find it very surprising, but then I tend to be more optimistic than average about the capabilities of transformer models and the prospect of general AI in the relatively near term.


I would have guessed around the same chance, this was surprising to me after playing around with copilot and not being impressed at all.


I would have said there is a ~0% chance of this happening within our lifetimes.


There is a prediction market called Metaculus.

TL;DR In 2020 community of 169 people and the best forecasters were assigning ~15% that it will happen by July 2021.

More specifically, on Dec 31, 2016 in partnership with Center for the Study of Existential Risk, Machine Intelligence Research Institute, and The Future of Life Institute they asked:

How long until a machine-learning system can take a simple text description and turn it into a program coded in C/Python?

https://www.metaculus.com/questions/405/when-will-programs-w...

First 19 forecasters in March 2017 were predicting mid-2021, the best forecasters were predicting late 2024. When the question closed in 2020 the community was predicting January 2027 and the best forecasters were predicting March 2030.

The question resolved on July 2021 when Codex was published.

Community and the best forecasters were assigning ~15% that it will happen by July 2021.

I'm currently 14th best forecaster there and I was predicting 33% before July 2021. It was my last prediction, and it was made on October 2018.

I'm also predicting 75% that we will have AGI by 2040 as defined in this question:

https://www.metaculus.com/questions/3479/when-will-the-first...

20% that it will happen before 2030.

There is also stronger operationalization:

https://www.metaculus.com/questions/5121/when-will-the-first...

My prediction here is 60% before 2040 and 5% before 2030.

I have also "canary in the coal mine" questions:

When will AI achieve competency on multi-choice questions across diverse fields of expertise? Community predicts 50% before 2030, I agree.

https://www.metaculus.com/questions/5276/ai-competence-in-di...

When will AI be able to learn to play Montezuma's Revenge in less than 30 min? Community predicts 50% before 2025, I think 50% before 2027.

https://www.metaculus.com/questions/5460/ai-rapidly-learning...


For some reason I forgot to check metaculus for this. Thanks for the reminder.


This is kind of neat. I wonder if it will one day be possible for it to find programs that maintain invariant properties we state in proofs. This would allow us to feel confident that even though it's generating huge programs that do weird things a human might not think of... well that it's still correct for the stated properties we care about, ie: that it's not doing anything underhanded.


Calling it now: If current language models can solve competitive programming at an average human level, we’re only a decade or less off from competitive programming being as solved as Go or Chess.

Deepmind or openAI will do it. If not them, it will be a Chinese research group on par with them.

I’ll be considering a new career. It will still be in computer science but it won’t be writing a lot of code. There’ll be several new career paths made possible by this technology as greater worker productivity makes possible greater specialization.


The problem is this view continues to view software engineers as people that write code, that's not what my job is, it is figuring out how to solve a business problem using technology, and getting people on board with that solution and updating and refining it.

This viewpoint seems to me to be very similar to the idea of 3rd generation languages replacing developers because programming will be so easy, it isn't about how easy it is to write code, I function as a limited mentat taking all the possible requirements, tradeoffs constraints, analyzing them and then building the model, then I write out the code, the code artifact is not the value I add. The artifact is how I communicate the value to the world.

This doesn't make programmers redundant anymore than Ruby, PHP, or Java made developers redundant because it freed them from having to manually remember and track memory usage and pointers, it is at most a tool to reduce the friction of getting what is in my head into the world.

I control the code and whoever controls the code controls the business. I posses the ability to make out the strands of flow control and see the future state of the application. For I am the Sr. Software engineer and I have seen where no Project Manager can see.

Apologies to Frank Herbet I just finished listening to Dune.

EDIT:

I got off track at the end but my point is that no matter how good the tools for developing the code are, they will never replace a software engineer anymore than electric drills and power saws replace home builders. It merely elevates our work.


I actually agree with you on that. I had another comment further down the thread where I said that software engineering can’t be fully automated by anything short of artificial general intelligence.

As humans we have a coherent world model that current AI systems are nowhere near close to having.

That coherent world model is a necessary precondition for both understanding a business goal and implementing a program to solve it. AlphaCode can do the second part but not the first.

AlphaCode doesn’t have that world model and even if it did it still wouldn’t autonomously act on it, just follow orders from humans.

Competitive programming is going to be solved much earlier than programming in a business context will, because it’s completely independent of business requirements. It’s at most half as hard of a problem .


If I am given the ability to produce a program by formalizing the fuzzy requirements I am given, I will not hesitate to abuse this option. I can see a future where there is be a "market" for specifications to be composed together.

Analyzing the requirements is a hard problem when we do it with our brain. But our job would be very different if all we had to do it to write down the constraints, and press a button to see an error: invalid requirements, can't support this and that at the same time.


Three months ago in the Copilot thread I was saying

> in 5 years will there be an AI that's better than 90% of unassisted working programmers at solving new leetcode-type coding interview questions posed in natural language?

and getting pooh-poohed. https://news.ycombinator.com/item?id=29020401 (And writing that, I felt nervous that it might not be aggressive enough.)

There's this general bias in discussions of AI these days, that people forget that the advance they're pooh-poohing was dismissed in the same way as probably way off in the indefinite future, surprisingly recently.


The issue is these techniques are growing in capabilities exponentially, while we have a habit of extrapolating linearly. Some saw the glaring deficits in copilot then reasoned that linear improvements is still glaring deficits. I don't know that this bias can ever be corrected. A large number of intelligent people simply will never be convinced general AI is coming soon no matter what evidence is presented.


> techniques are growing in capabilities exponentially, while we have a habit of extrapolating linearly

What does this even mean? How do you put a number on AI capability? You can say it is growing faster than people expect, but what is even exponential or linear growth in AI capability?


I take your point that the linear/exponential terminology is a bit dubious. But the simple way to make sense of it is just going by various benchmarks. E.g. the power-law relationship between the model accuracy and the model size: https://eliaszwang.com/paper-reviews/scaling-laws-neural-lm/


Yes, for very precise, comprehensive text descriptions of problems.

It will take a far-far more advanced AI to write such descriptions for real-world problems.

Writing requirements for a project is difficult work, and not for technical reasons, but for human reasons (people don't know what they want exactly, people have trouble imagining things they haven't seen yet, people are irrational, people might want something that is different from what they need, etc.)

In this regard, we are safe for a few more decades at least.


Yes, they have been trying to create 'sufficiently formal human readable text' to spec out projects; not detailed enough to execute by a computer but formal and precise enough so humans know exactly what they are getting. That still doesn't work at all and that is between humans. If the specs are clear enough, the act of programming is already mostly not the issue, however, they never are. I am looking forward to ML helping me writing boring code (which CoPilot already does, but again, that's not really where time/energy is spent anyway) and protect against security issues, scalability issues and all kinds of bugs (it could rewrite algo's it knows; it could recommend libraries that I should use instead of the crap I rolled myself etc).


Fully automating software engineering won’t happen until AGI. As a good Yuddite I expect us to have bigger problems when that happens.

You need an agent with a large and coherent world model, in order to understand how your programs relate to the real world, in order to solve business tasks.

This isn’t something any program synthesis tech currently available can do, because none of it has a coherent world model.

GPT-3 comes closest to this, but isn’t able to engage in any kind of planning or abstract modeling, beyond semi coherent extrapolations from training data.

Maybe scaling up GPT by a few more orders of magnitude would work, by generating an emergent world model along the way.


What is a "Yuddite?" I tried Googling for it and got the impression it was LessWrong forum terminology for people who believed too strongly in LessWrong, but I couldn't find many references.


I believe he's referring to "luddites" -- a group of people who resisted technological innovation during the industrial revolution.


Luddite but mixed with "Eliezer Yudkowsky" who is a researcher working on the problem of friendly AI (or whatever they're calling it these days). Basically trying to prevent skynet.

The GP is saying that once we have AGI, then "AGI is going to make the human race irrelevant" outweighs "AGI makes software devs irrelevant".


That’s the idea.


I am a follower of Elizier Yudkowsky.


I would actually argue the programmers job has never been 100% writing the code, it’s always been interpreting, fixing and decoding the ideas of others.


The older I get the more I see it has not been about programming for most tasks for quite a long time. In the early 80s it was a bit more (but not even much more); at that time as well I spent most of my time debugging and changing behaviour slightly (but in a lot of pages) instead of just cranking out huge bags of code.


I would argue that we figured this out over 50 years ago but oddly enough some people still hold onto the idea.


A programming genie that grants programming wishes to the general public. Since most of what I do on a daily basis is engineering solutions based on tradeoffs, I can only imagine the number of programmers needed to debug solutions given by the programming genie in response to poorly described feature requests.

If we become mechanics of the software AI vehicles of the future, so be it.


AI is being aggressively applied to areas where AI practitioners are domain experts. Think programming, data analysis etc.

Programmers and data scientists might find ourselves among the first half of knowledge workers to be replaced and not among the last as we previously thought.


I'm already anticipating having the job title of "Query Engineer" sometime in the next 30 years, and I do NLP including large scale language model training. :(


One of the big venture capitalists predicted “prompt engineering” as a future high paid and high status position.

Essentially handling large language models.

Early prompt engineers will probably be drawn from “data science” communities and will be similarly high status, well but not as well paid, and require less mathematical knowledge.

I’m personally expecting an “Alignment Engineer” role monitoring AI systems for unwanted behavior.

This will be structurally similar to current cyber security roles but mostly recruited from Machine Learning communities, and embedded in a broader ML ecosystem.


I like this descriptions better, considering that companies like Anthropic are working specifically on Alignment and AI Safety. Being that the team actually spun out of Deep Mind, it is interesting.


Alignment is going be a giant industry and will also include many people not originally in Stem. The humanities and “civil society” will both have their contributions to make.

It’s likely that alignment jobs won’t themselves be automated because noone will trust AI systems to align themselves.


>“Alignment Engineer” role monitoring AI systems for unwanted behavior.

ha, I know people already doing this..


The thing is, Competitive Programming (CP) is a completely different discipline/subject with its own trivia knowledge and tricks. CP uses Computer Science the same way as e.g. Biology uses Mathematics. It has very little in common with a real world software development.


I said as much in another comment.

Automating the software development profession proper is going to be much harder and will require autonomous agents with coherent world models, because that’s what you need to act in a business context.


This is in line with what other code generation AI's have accomplished.

To reach average level at codeforces you need to be able to apply a standard operation like a sort, or apply a standard math formula, as the first 1-2 problems in the easy contests are just that. It is impressive that they managed to get this result in real contests with real unaltered questions and see that it works. But generalizing this to harder problems isn't as easy, as there you need to start to device original algorithms instead of just applying standard algorithms, for such problems the model needs to understand computer science instead of just mapping language to algorithms.


Calling it now: Your prediction is off by an order of magnitude or two (10 years -> 100 years, or 1000 years)


It can be really tempting to think about research progression on a "linear" timescale but more often than not it eventually ends up following an "exponential" curve because of technical debt. And there appears to be a _lot_ of techniques used here which we don't fully understand.

I wouldn't be surprised if a specifically engineered system ten years from now wins an ICPC gold medal but I'm pretty sure that a general purpose specification -> code synthesizer that would actually threaten software engineering would require us to settle a lot of technical debts first -- especially in the area of verifying code/text generation using large language models.


It doesn't even have to be average human.

Let's say AI only gets to 10% (or 20% or 30% or whatever, it doesn't really matter), that's a huge number of jobs being lost.

Imagine having a machine write all the "simple/boring" code for you. Your productivity will go through the roof. The smartest programmer who can most effectively leverage the machine could replace many hundreds of programmers.

I should brush up on my plumbing and apply for a plumbing license soon. (I think plumbing is safer than electricians, because many CS people have good EE foundations).


You're extrapolating across very different types of problems. Go and Chess have unlimited training data. Competitive programming does not.


To me, that's actually one of the more interesting questions. It's possible to grade the output of the AI against objective criteria, like does it run, and resources consumed (RAM, CPU time, and, particularly of interest to me, parallel scaling, as GPU algorithms are too hard for most programmers). To what extent can you keep training by having the AI generate better and better solutions to a relatively smaller input pool of problems? I skimmed the paper to see how much they relied on this but didn't get a clear read.


Depending on what you want to do, you can either choose an industry with very fuzzy requirements (to stay near the programming side) or one with very complex but with strict requirements (to benefit from those coding robots). I guess we will need simulators for most of what we do in order to train those robots.


Didn’t we all (collectively) have this discussion the last time someone put the math functions in a library and rendered math calculation programmers obsolete?


>> There’ll be several new career paths made possible by this technology as greater worker productivity makes possible greater specialization.

Can you list a few?


How long before it can write the code without plagiarizing code from online?


How long before the typical human coder can do so?


Are you saying you cannot write code from scratch?


Not the parent comment, but I cannot code from scratch (outside of very simple and small applications). Competitive Programming is at about the limit of what I can do without looking things up, and only because I've had practice specifically for that kind of artificial environment.


I can write some code from scratch, but my ability to write code is improved by an order of magnitude when I can refer to online resources, including example code.


Humans study CS for 5 years, reading code from online to be able to solve these problems.


Don't worry, there are a lot of much simpler jobs, like drivers or cashiers that will surrender to AI before coder's job does. So UBI will be implemented long before that happens.


I wouldn't be so sure. Programmers (and drivers and cashiers) can "survive" in poverty like millions others already do. This transformation is coming in waves that keep the proverbial frog in the pan.


It reminds me that median reputation on StackOverflow is 1. All AlphaSO would have to do is to register to receive median reputation on SO ;) (kidding aside AlphaCode sounds like magic)

Inventing relational DBs hasn't replaced programmers, we just write custom DB engines less often. Inventing electronic spreadsheets hasn't deprecated programmers, it just means that we don't need programmers for corresponding tasks (where spreadsheets work well).

AI won't replace programmers until it grows to replace the humanity as a whole.


>AI won't replace programmers until it grows to replace the humanity as a whole.

Yes, but after seeing this progress in the former, my time estimate of time remaining until the latter had just significantly shortened.


Given close to zero chances of a safe AI, I'm optimistic that AI is a much tougher problem and we are not significantly closer to the solution than e.g., in 60s when computer vision was a summer project.

There is a progress in certain domains (such as image recognition) but (outside specialized tasks) gigantic language models look like no more than impressive BS generators.


I don’t even think the “will AI replace human programmers” question is that interesting anymore. My prediction is that a full replacement won’t happen until we achieve general artificial intelligence, and have it treat programming as it would any other problem.

Elsewhere ITT I’ve claimed that to fully automate programming you also need a model of the external world that’s on par with a humans.

Otherwise you can’t work a job because you don’t know how to do the many other tasks that aren’t coding.

You need to understand what the business goals are and how your program solves them.


> AlphaCode placed at about the level of the median competitor,

In many programming contests, a large number of people can't solve the problem at all, and drop out without submitting anything. Frequently that means the median scoring solution is a blank file.

Therefore, without further information, this statement shouldn't be taken to be as impressive as it sounds.


> Creating solutions to unforeseen problems is second nature in human intelligence

If this is true then a lot of the people I know lack human intelligence...


I am always surprised by the amount of skepticism towards deep learning on HN. When I joined the field around 10 years ago, image classification was considered a grand challenge problem (e.g. https://xkcd.com/1425/). 5 years ago, only singularity enthusiast types were envisioning things like GPT-3 and Copilot in the short term.

I think many people are uncomfortable with the idea that their own "intelligent" behavior is not that different from pattern recognition.

I do not enjoy running deep learning experiments. Doing resource-hungry empirical work is not why I got into CS. But I still believe it is very powerful.


This scepticism shouldn't surprise you. Not being sceptical is just an indicator that you've not been in the field for long enough.

30 years ago, the end of programming was prophesised, because 5th generation languages (5GL) and visual programming would enable everybody to design and build software.

20 years ago, low-code and application builders were said to revolutionise the industry and allow people in business roles to build their applications using just a few clicks. End-to-end model-driven design and development (e.g. using Rational Rose and friends) were to put an end to bugs and maintenance problems.

10 years ago it was new programming languages (e.g. Rust, Go, Swift, ...) and a shift to functional programming that was advertised as being "the future".

Today it's back to "no code", e.g. tool-(AI-)driven development that's all the rage.

It's not so much being "uncomfortable" or clinging to the exceptionalism of the human mind. It's just experience. Every decade saw its great big hype and technological breakthrough, but the lofty promises didn't hold water.

Note that this doesn't mean nothing changed - model driven development still has its niche, visual programming is widely used in video production, rendering and game development. Features of functional programming have been added to many "legacy" languages and many of the newly introduced programming languages have become mainstream.

The same will happen with AI generated software. There a large portion of the "mechanical" process of programming will be done by AI. Large and complex software systems with changing requirements, however, will still be designed and implemented primarily by people.

Programming is a conversation between humans and machines. AI will in many cases shift the conversation closer to the human side, but fundamentally it'll still be the same thing.

I like to think of it as the difference between writing your program in assembly and writing it in Haskell; different approaches, same basic activity.


I think you and GP are talking about different things.

You're saying a lot of so-called technological breakthrough is more hype than substance. The GP is saying that people tend to dismiss actual breakthroughs as mundane stuff. Once $method is published that solves $hardproblem, people comment as if $hardproblem was never hard in the first place, and moves the goalposts a bit saying "if $harderproblem can be solved, then that would be profound".

I think the truth is (obviously) somewhere in between. Btw, I dare you go back to a 1980s programming environment and tell me that the programming paradigm shifts are just hype :D My one-liner python scripts can probably do much more than an average coder writing assembly... and given modern hardware my code runs faster too!


> Btw, I dare you go back to a 1980s programming environment and tell me that the programming paradigm shifts are just hype

Been there, done that. I did consulting for a huge company a few years back. They ran their entire business on IBM mainframes running an ancient VSE-based OS.

I had the pleasure of maintaining IBM HLASM (high level assembly) programs with change logs dating back to 1982.

Working with those programs (they were excellently documented) using ICCF wasn't much different from using vim really and the language itself is by far the best assembly dialect I've ever worked with (especially the powerful macro system).

Sure, productivity is much higher in higher level languages if only because you need to write less code. Your Python one-liner, however, can still be as wrong as 100-lines of assembly or 20 lines of C if you make the wrong assumptions.

That's the part that just doesn't change, no matter the underlying technology: garbage in - garbage out. Someone has to write the problem specification and more often than not, that's the part where things start to go sideways.

It's also one of the reasons model-driven development didn't really catch on: MDD only works of you know your problem domain to a T beforehand, because iterating models is a pain; that's rarely the case, though as usually code and understanding of the problem evolve side-by-side.

Explaining a problem precisely, concisely, and correctly so that an AI can synthesise software that hopefully implements it correctly is not as great as a leap forward as you might think.

I'd really suggest taking a look a Rational Rose and similar platforms to get a glimpse at what automated code generation looked like 25 years ago - even back then you rarely had to write actual code (provided the problem domain was well-known and well-specified), even without AI.


Seems to me that this accelerates the trend towards a more declarative style of programming where you tell the computer what you want to do, not how to do it


Do I understand it correctly that it generated (in the end) ten solutions that then were examined by humans and one picked? Still absolutely amazing though.


No human examination was done.

But it generated 10 solutions which it ran against the example inputs, and picked the one that passed.

Actually I'm not sure if it ran the solutions against the example inputs or the real inputs.


They used the real inputs. The example inputs were used to filter out which candidates to submit for the 10 tries.


No, they gave the algorithm 10 tries and tested all of them, and said that it was solved if any one of them worked.


It would be interesting if a future 'AlphaZeroCode' with access to a compiler and debugger can learn to code, generating data using self-play. Haven't read the paper yet, seems some impressive milestone.


Does this mean that we can all stop grinding leetcode now?


What I always find missing from these Deep Learning showcase examples are an honest comparison to existing work. It isn’t like computers haven’t been able to generate code before.

Maybe the novelty here is working from the English language specification, but I am dubious just how useful that really is. Specifications are themselves hard to write well too.

And what if the “specification” was some Lisp code testing a certain goal, is this any better then existing Genetic Programming?

Maybe it is better but in my mind it is kind of suspicious that no comparison is made.

I love Deep Learning but nobody does the field any favors by over promising and exaggerating results.


I have fiddled with genetic programming. I don't think there is a good solution for a useful metric for comparing one code generator against another, so I don't think DeepMind should care.

Most of the genetic programming results code generated by my algos doesn't compile. Very occasionally the random conditions exist to allow it to jump over a "local maxima" and come up with a useful candidate source code. Sometimes the candidates compile, run, and produce correct results.

The time it takes to run varies vastly with parameters (like population, how the mutation function works, how the fitness function weights/scores, etc).

Personally I really like that these DeepMind announcements don't get lost in performance comparisons, because inevitably those would get bogged down in complaints like "the other thing wasn't tuned as well as this one was". Let 3rd party researchers who have access to both do that work, independently.


If you think tuning GP parameters are challenging wait until you try tuning hyper parameters for a DL model!

It is just a press release, to be fair to DeepMind, and I guess they can promote themselves however they wish.

My original comment was more from the context of seeing neural network models in practice perform barely any better, if at all, then classic ML models. Just as those comparisons were revealing similarly I was suspecting this use case may be the same to another classic technique.

GP is certainly not the shining star of AI right now but it is actively researched and perusing Google scholar on the subject will show you plenty of interesting, but less heralded, results.

There are probably several meaningful metrics for this problem that can be examined. If nothing else it is a simple matter of grading the solutions of each, like a university assignment. Also, typically classical techniques are less resource intensive then any neural network methods; the energy savings alone when considered at production scales would be significant.


Specs are hard to write because the person interpreting them may not understand you, but it's the iteration time and cost that kills you.

Make me a sandwich -> two weeks and $10k isn't viable

Make me a sandwich -> 2 seconds and free, totally viable


To me, coding in imperative languages are one of the hardest things to produce an AI for with current approaches (CNN’s, MCTS and various backpropagation). Something like Cyc would seem to be a lot more promising…

And yet, I am starting to see (with GitHub’s Copilot, and now this) a sort of “GPT-4 for code”. I do see many problems with this, including:

1. It doesn’t actually “invent” solutions on its own like AlphaZero, it just uses and remixes from a huge body of work that humans put together,

2. It isn’t really ever sure if it solved the problem, unless it can run against a well-defined test suite, because it could have subtle problems in both the test suite and the solution if it generated both

This is a bit like readyplayer.me trying to find the closest combination of noses and lips to match a photo (do you know any open source alternatives to that site btw?)

But this isn’t really “solving” anything in an imperative language.

Then again, perhaps human logic is just an approaching with operations using low-dimensional vectors, able to capture simple “explainable” models while the AI classifiers and adversarial training produces far bigger vectors that help model the “messiness” of the real world and also find simpler patterns as a side effect.

In this case, maybe our goal shouldn’t be to get solutions in the form of imperative language or logic, but rather unleash the computer on “fuzzy” inputs and outputs where things are “mostly correct 99.999% of the time”. The only areas where this could fail is when some intelligent adversarial network exploits weaknesses in that 0.001% and makes it more common. But for natural phenomena it should be good enough !


Can you write more about how Cyc would help? The idea behind Cyc is cool but I don’t think I’ve seen anyone discuss using it for program synthesis.


If you want some video explanation https://youtu.be/Qr_PCqxznB0


And this is how we reach the technological singularity and how programmers become as equivalently out-of-demand as piano tuners: self-programming systems.

AI will eat any and all knowledge work because there's very little special a human can do that a machine won't be able to do eventually, and much faster and better. It won't be tomorrow, but the sands are inevitably shifting this way.


It is obvious to me that computer programming is an interesting AI goal, but at the same time I wonder if I'm biased, because I'm a programmer. The authors of AlphaCode might be biased in this same way.

I guess this makes sense though, from a practical point of view. Verifying correctness would be difficult in other intellectual disciplines like physics and higher mathematics.


Just make it output a proof together with the program.


That won't work because the systems aren't trained on proofs and proper theorem provers don't work that way either.


I am thinking whether this result can create a type of loop that can self-optimize.

We have AI to generate reasonable code from text problem description.

Now what if the problem description text is to generate such a system in the first place?

Would it be possible to close the loop, so to speak, so that over many iterations:

- text description is improved

- output code is improved

Would it be possible to create something that converges to something better?


I am actually trying this. Basically by asking questions to AI and teaching it to generate code / google when it doesn't know something. The other process checks if code is valid and either ask it to get more context or executes code and feeds back to file :)


I think one can make problem "differentiable" via some heuristics and if you have NN trained to rate code quality and some understanding what should be used for type of problem, memory and speed and than can classify problem to group then rate solution it should be able to guide the process (in competitive programming).


Do you have a blog or a github or something? This sounds really neat.


I agree with most of the comments I've read in this thread. Writing code to solve a well defined narrowly scoped problem isn't that hard or valuable. It's determining what the problem actually is and how software could be used to solve it that is challenging and valuable.

I would really like to see more effort in the AI/ML code generation space being put into things like code review, and system observation. It seems significantly more useful to use these tools to augment human software engineers rather than trying to tackle the daunting and improbable task of completely replacing them.

*Note: as a human software engineer I am biased


Next they can train it on kaggle, and we'll start getting closer to the singularity


I just hope that this shows how useless competitive programming is that it can be replace by the Transformer-model.

Additionally, people should REALLY rething their coding interviews if they can be solved by a program.


Hey, honest question: how does one get into competitive programming? I imagine it goes far beyond just leetcoding but honestly i don't even know where to start.


Most people here are programmers (or otherwise involved in the production of software). We shouldn't look at RPA and other job automation trends dispassionately. SaaS valuations aren't were they are (and accounting doesn't treat engineering salary as cost of goods sold) because investors believe that they will require armies of very well paid developers in perpetuity.


what?


> In our preprint, we detail AlphaCode, which uses transformer-based language models to generate code at an unprecedented scale, and then smartly filters to a small set of promising programs

if you're using a large corpus of code chunks from working programs as symbols in your alphabet, i wonder how much entropy there actually is in the space of syntactically correct solution candidates.



I suspect these code generating AIs will bring the singularity at some point in the future. Even if we don’t manage to create an artificial general intelligence, they will. I imagine they will learn to code on super human levels through self play just like AlphaGo and AlphaZero did. This will be awesome.


Between developments like this (and Copilot [Is there a general accepted word for this class of things e.g. "AI Coders"?) and the move toward fully remote, I predict the mean software engineering salary in the United States will be lower in 10 years (in real dollars) than it is today.


I think this is a safe bet, but I would make it with or without the presence of AI Coders. We're clearly in the middle of Tech Bubble 2.0 and it's sure to pop in the next 10 years (and probably much sooner, given the recent crypto and NASDAQ rumblings).


People have been talking about tech bubbles for years, there might be a small financial bubble due to the money printing in recent years but I'm not seeing a big bust coming like dot-com. Tech compensation is probably more influenced by the discrepancy between locations. Once people figure out how to properly handle remote workers and remote teams (which is happening due to Covid), global compensations level will probably level out.


Great. Now the only thing remaining is POs being able to come with a clear spec and I'm out of job


Since they used the tests this is not something you can do if you don't have a rich battery of tests.

Perhaps many problems are something like finite automata and the program discover the structure of the finite automata and also an algorithm for better performance.


>> AlphaCode ranked within the top 54% in real-world programming competitions, an advancement that demonstrates the potential of deep learning models for tasks that require critical thinking.

Critical thinking? Oh, wow. That sounds amazing!

Let's read further on...

>> At evaluation time, we create a massive amount of C++ and Python programs for each problem, orders of magnitude larger than previous work. Then we filter, cluster, and rerank those solutions to a small set of 10 candidate programs that we submit for external assessment.

Ah. That doesn't sound like "critical thinking", or any thinking. It sounds like massive brute-force guessing.

A quick look at the arxiv preprint linked from the article reveals that the "massive" amount of prorgams generated is in the millions (see Section 4.4). These are "filtered" by testing them against program input-output (I/O) examples given in the problem descriptions. This "filtering" still leaves a few thousands of candidate programs that are further reduced by clustering to "only" 10 (which are finally submitted).

So it's a generate-and-test approach rather than anything to do with reasoning (as claimed elsewhere in the article) let alone "thinking". But why do such massive numbers of programs need to be generated? And why are there still thousands of candidate programs left after "filtering" on I/O examples?

The reason is that the generation step is constrained by the natural-language problem descriptions, but those are not enough to generate appropriate solutions because the generating language model doesn't understand what the problem descriptions mean; so the system must generate millions of solutions hoping to "get lucky". Most of those don't pass the I/O tests so they must be discarded. But there are only very few I/O tests for each problem so there are many programs that can pass them, and still not satisfy the problem spec. In the end, clustering is needed to reduce the overwhelming number of pretty much randomly generated programs to a small number. This is a method of generating programs that's not much more precise than drawing numbers at random from a hat.

Inevitably, the results don't seem to be particularly accurate, hence the evaluation against programs written by participants in coding competitions, which is not any objective measure of program correctness. Table 10 on the arxiv preprint lists results on a more formal benchmar, the APPS dataset, where it's clear that the results are extremely poor (the best performing AlphaCode variant solves 20% of the "introductory" level problems, though outperforming earlier approaches).

Overall, pretty underwhelming and a bit surpirsing to see such lackluster results from DeepMind.


The year is 2025, Google et al. are now conducting technical on-site interviews purely with AI tools and no human bias behind the camera (aside from GPT-3's quirky emotions). The interview starts with a LC hard, you're given 20 minutes -- good luck!


I think Amazon already tried this and it had surprisingly racist results


I think CoPilot, etc will be revolutionary tools AND I think human coders are needed. Specifically I love CoPilot for the task of "well specified algorithm to solve problem with well-defined inputs and outputs". The kind of problem you could describe as a coding challenge.

BUT, our jobs have a lot more complexity

- Local constraints - We almost always work in a large, complex existing code base with specific constraints

- Correctness is hard - writing lots of code is usually not the hard part, it's proving it correct against amorphous requirements, communicated in a variety of human social contexts, and bookmarked.

- Precision is extremely important - Even if 99% of the time, CoPilot can spit out a correct solution, the 1% of the time it doesn't creates a bevy of problems

Are those insurmountable problems? We'll see I suppose, but we begin to verge on general AI if we can gather and understand half a dozen modalities of social context to build a correct solution.

Not to mention much of the skill needed in our jobs has much more to do with soft skills, and the bridge between the technical and the non technical, and less to do with hardcore heads-down coding.

Exciting times!


I think it would be interesting the train a system end-to-end with assembly code instead of various programming languages. This would make it a much more generic compiler


Oh sweet! When can skip the bullshit puzzle phone screens?


Ali Group CAPTCHA's or Android unlock?


The interesting stuff happens once AlphaCode gets used to improve the code of AlphaCode.


"And so in 2022 the species programmus programmicus went extinct"


I would stop programming if all we needed to write was unit tests :p


To compensate, lots of people would start programming if that happened though. Many scientists would be interested in solving their field's problems so easily - certainly maths would benefit from it.


wasn't it this the motivation for Prolog?


What about finding bugs, zero-day exploits?


Has nobody yet asked it to write itself?


I am a little bitter that it is trained on stuff that I gave away for free and will be used by a billion dollar company to make more money. I contributed the majority of that code before it was even owned by Microsoft.


Can you elaborate and give some history? What code did you contribute, and how did it end up being used by Microsoft and then DeepMind?


> We pre-train our model on selected public GitHub code and fine-tune it on our relatively small competitive programming dataset.

But since the code was 'selected' you don't know if your code was used. However, they seem to have used Python and C++, so my code is probably not part of it.


Paying it forward, it will help others in turn.


Yes it will help the already powerful players disproportionately.


They opensourced alphafold for anyone to use commercially despite big financial incentive to keep it private and use in their new drug discovery lab. No idea how this works or differs from alphafold but imagine they'll do the same here if possible


Only after another lab made their own open source one that was comparable.


The problem is not really that microsoft owns github, or that licenses allow corporations free use, but that the tech giants are so big and have so much power.


Wake me up when an AI creates an operating system on the same level of functionality as early-years Linux.


That will happen faster than you can conceive because you won't be aware of the progress until it is announced.

And, have you tried polling? I hear it keeps the CPU warm in winter. Interrupts are so ... this just in, Nike's stock jump 3% ... Where was I? Did I save my task context properly? Did I reenable interrupts?


Genuine question, what are the reasons to be a software engineer without much ML knowledge in 2022. Seems like a wake up call for developers


7 months ago, I asked natfriedman the same question, of which he responded: "We think that software development is entering its third wave of productivity change. The first was the creation of tools like compilers, debuggers, garbage collectors, and languages that made developers more productive. The second was open source where a global community of developers came together to build on each other's work. The third revolution will be the use of AI in coding. The problems we spend our days solving may change. But there will always be problems for humans to solve."

https://news.ycombinator.com/item?id=27676266&p=2


> what are the reasons to be a software engineer without much ML knowledge in 2022.

I'm not quite sure what you're asking, but my reason is that I do not enjoy working on/with ML. I'd personally rather quit the industry.

But I work in embedded/driver development. I do not worry about ML models replacing me yet, but if I were just gluing together API calls I would be a bit worried and try to specialize.


Genuine question: what are the reasons to be a carpenter without much robotics / automation knowledge in 2022. Seems like a wakeup call for carpenters.


Find something that’s hard and interesting. Someone will probably have a business trying to solve it and will hire you.


I hope you are right, but just to answer the question: all those other AI winters.


Thats a good meditation. I think the winters were more driven by research dichotomy, for example Marvin Minsky's critique of the perceptron really slowed the research by 10 years. Advances made thus far have too much commercial relevance that companies invested dont look like they are gonna stop soon. But its a valid point. Looks like there is more upside being in subsets of computing like quantum computing, web3, metaverse etc than being a regular front-end engineer




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: