I think the biggest problem copilot will have in practice gaining traction is that verifying correctness isn’t any faster than writing the code yourself in many cases. The Easter(y) function is a classic example - it would be way faster to write that than to try and verify that there’s no subtle bugs.
Copilot is by design trying to give you something that _looks_ correct without caring whether it actually is - so it optimises for real-looking but subtly buggy code, which is the worst kind of broken code.
Years ago, I ran into a similar problem working on a program that was doing named entity recognition to assist humans with data entry. We found that, for our purposes, there seemed to be no (realistic) accuracy threshold beyond which the tool would save clients money, because double-checking the machine-generated output was inherently more work than doing it by hand.
So we pivoted the product to being something you would run on full auto, for situations where you didn't need a high level of quality. I'm not sure if that option is available to programmers, though.
Maybe Copilot could be turned into a context-aware search engine? That is, invoking it would return a list of examples that it thinks do the same thing as what you're trying to do, based on your work-in-progress code.
I honestly think this is where things are going. Man-machine partnership on creative tasks. Not only is it more amenable to current models, it’s higher leverage and less likely to be completely automated away.
> I think the biggest problem copilot will have in practice gaining traction is that verifying correctness isn’t any faster than writing the code yourself in many cases.
Humorously, this is a similar problem to the one autonomous driving has. Being alert when something goes wrong randomly is more difficult than being alert all of the time.
However in the real world people don't always write bugless code and aren't always alert when driving. Therefore these AI assistants can still have a net positive result as long as they are better than the average performance of a human. Of course three quarters of us probably believe that "I'm not an average programmer so Copilot would only make me worse."
Personally I think the more interesting angle is the trolley problem this creates. People will die in self-driving car accidents and bugs will exist in AI generated code. Those people and bugs are different than the people who will die in human caused accidents and the bugs in human written code. If the number and severity of the results are lessened by the computer, are we willing to forgive the damage directly caused by the AI that falls short of perfection?
I’m a very mediocre developer and if Copilot is any better than me at writing code, then I will have a hard time understanding whatever Copilot throws at me. I cannot just save, commit and push whatever Copilot suggests... so it’s faster if I write the code myself than to review Copilot’s code.
> I cannot just save, commit and push whatever Copilot suggests
I don't think that is the goal just like the goal of the current generation of self-driving cars isn't for you to be able to take a nap in the driver's seat.
Imagine you need some code that would have traditionally taken you and hour to write. I believe the goal of Copilot is to generate the code for you as a starting point. Maybe you don't understand that code immediately and it takes you 20 minutes to figure out what is going on. Then you spend another 20 minutes tweaking it for your exact purpose. If that results in code of similar quality to what you would have written alone, then Copilot makes you more efficient by saving you 20 minutes.
I haven't had a chance to try it yet, but I'm skeptical of the time savings claim of copilot in its current form. At least working on a large code base, the things that take time are:
1) Understanding the data model and logic of the code that interacts with the component I'm working on
2) Refactoring existing code to accommodate my change gracefully
3) Writing and fixing tests
4) Working through the code review process
For a major new piece of functionality, add
5) Put together a design document and review it with relevant stakeholders
The part that is fast is actually writing the code, as once I've done steps 1 and 2 (and sometimes 5) writing the new code itself is near trivial. I don't see how copilot could possibly help me in a meaningful way on these kinds of tasks.
The work that seems most amenable to copilot help is things like utility functions for transforming data/calculating things from it, as in the "Easter" example from the article. But here I would rather use a well-tested library, or if one doesn't exist (or I can't use it), write well documented code that I understand thoroughly.
Put another way, the work that copilot seems most adept at is "junior developer" work performed by people operating at a junior level. But if they delegate "figuring things out" to copilot, they're just going to spend way more time in code review. Or worse, they're not going to spend that time, and will learn nothing/stagnate in their professional progression.
Ever since the advent of satellite nav I've become terrible at learning my way around cities. I'm okay with the loss, since I can generally rely on having nav when I need it, and navigating cities isn't one of my core responsibilities. Copilot is not reliable (it won't answer your question every time), and it automates something that is your actual job. A junior dev might be better served by spending the extra 20 minutes muddling through and building their skillset.
> I don't think that is the goal just like the goal of the current generation of self-driving cars isn't for you to be able to take a nap in the driver's seat.
I think the issue is that the MVP from a customer perspective is, effectively, being able to take a nap in the driver's seat. From a research perspective there are obviously intermediate milestones, but that doesn't make it fit for what people would want to use it for. Same goes for Copilot.
>I think the issue is that the MVP from a customer perspective is, effectively, being able to take a nap in the driver's seat.
Maybe that is a requirement for some users, but it isn't a universal one. Plenty of people see a benefit in assistive technology that isn't complete such as adaptive cruise control or boilerplate/scaffolding dev tools.
It also raises the ethical question of whether these creators are responsible for the misuse of their products. Is it enough for them to say "This is how this product should be used. You are on your own if you use it outside these settings."? Holding developers responsible for the misuse of their software could create an actual slippery slope. Where is the line drawn? Do we start punishing people who create encryption algorithms because someone used the encryption to hide evidence of a crime?
> It also raises the ethical question of whether these creators are responsible for the misuse of their products. Is it enough for them to say "This is how this product should be used. You are on your own if you use it outside these settings."?
I don't think you have to answer the ethical question to address the level of readiness that Copilot or self-driving cars are at. It definitely raises the question, but you don't have to answer it to talk about suitability for use cases.
As you say, it might address the requirements of some specific people. My argument is that Copilot is not good enough yet for the bulk of imagined use cases, whether or not you call that MVP, and I think the post makes a good argument about why.
So, you can at least make a theoretical argument for why self-driving cars can do a better job than humans: they are always alert and paying attention, and the set of things they're trying to accomplish are concrete, can reasonably be presumed a priori and baked into themodel, and are reasonably well specified so that we hopefully don't need hard AI to be successful.
By contrast, Copilot doesn't necessarily have any idea what you're trying to do. So it can, to an approximation, pattern match on what you've already written, and spit out valid code that is "inspired" by things it's seen in the past. But it doesn't actually know what you're trying to do. It doesn't know what your acceptance criteria are, or what invariants you're trying to maintain, or anything like that. And, at least in the places I've worked, most the interesting bugs (by which I mean, the ones that managed to cause trouble in production) happen when the programmer writing the code didn't have a firm idea of what they were trying to do. So, that's what worries me - I would fear that the spots where Copilot can't even theoretically be expected to do a good job happens to be exactly the kind of things for which people would tend to rely on it the most.
Maybe I'm being overly pessimistic? But that's kind of my job - I work in an area where "move fast and break things" is pretty antithetical. But it would still be a lot more compelling to me if I could see a paper demonstrating that a team using Copilot has fewer production defects than a team that's doing exactly the same work but without Copilot. Or alternatively, if it were repackaged as something that's a bit like a smarter version of IDE refactorings. "Hey, it looks like you're about to spit out a big old mess of boilerplate. Let us get that for you." Or, "Hey, some functions you called can fail, how about I go ahead and suggest a catch block so you don't forget to write one?" Basically, give me something that's a bit more smart cruise control and a bit less Autopilot.
> However in the real world people don't always write bugless code and aren't always alert when driving. Therefore these AI assistants can still have a net positive result as long as they are better than the average performance of a human.
This is an extremely good analogy -- in both situations, the human will become lazy and stop paying attention (regardless of whether they're supposed to keep their hands on the wheel, literally or metaphorically), and it will be possible to have a net result worse than either human or AI acting alone.
Are we also willing to just start accepting lower quality code (on average) because we have AI guiding us towards it.
What's the point of striving to write better, more correct code, being a safer driver, if all we ever do is rely on the status quo to train models to be average?
I've been using it and that's 100% correct. If it suggests more than a few lines, i might as well do it myself. However, it has been an awesome Intellisense tool for one-liners.. it can write out the rest of a comment or a simple map/filter method just fine. I don't think it will ever go further than that, but nor does it need to.
Be sure it doesn't generate any GPL licensed code down to the size of a single letter, to be safe. The reaction to GPL inspired snippet generation has been more fierce than I could have imagined, even though usable snippets are so short.
You can't license a code snippet. You can only copyright/license a complete work, as in a complete app or library. Software patterns might work differently though!?
You can not take the source code of an app, change a few lines, and then call it your own. You can however make your own app by stealing a few lines of code here and there... funny enough many companies write code so that all functions depend on all other functions making it impossible to reuse any parts of the code in another app (but i think that is not deliberate)
Yea, that's the major problem. I'd prefer just some sort of inline helper that could point directly to documentation, topics, stack overflow answers that might be helpful for whatever I'm developing. An enhanced "Intellisense" or something. That to me, is better, because ultimately it's up to the developer to place scrutiny on the solution. You basically can't blindly accept the implementation which makes this just a constant code review, of yourself? I dunno. This just seems half-baked
Agree. It’s way easier to write a function from scratch than to read/evaluate/fix whatever snippet Copilot throws at me. Replace Copilot with “junior engineer” or “senior engineer that knows more than me” and the result is the same (the junior engineer will probably introduce couple of subtle bugs that are hard to find; the senior engineer would write code in such a way that my mediocre brain won’t understand).
It looks to you like it should work, but it doesn't, and you can't figure out why.
That's not "mostly working," that's a frustrating waste of time. It's hard enough to notice when you accidentally swap `i` and `j` -- why would you want to make your life even more miserable by spending your time finding all of the instances where a pattern matching robot has done something similar in an unfamiliar block?
And if you do happen to get "mostly working" code, but only want it to stay together long enough for you to fundraise, you're basically stating that you plan on foisting this technical debt onto the poor sod you happen to hire.
Attitudes like yours are the reason this dogpile scares me.
If I understand your argument correctly, it's that GitHub Copilot does not produce functioning code, yes?
If that's the case, I agree with your assessment, that GitHub Copilot isn't delivering on its promise and I would not be using it.
My understanding, however, was the GitHub Copilot does produce functioning code. If you're saying, as I think you are, "No, GitHub is lying about Copilot." I find that claim fascinating (How the hell did we get to the point where a software company could release a product that literally does not do even the most basic version of what it says it does, and only a few people notice?), but I'd need more specific information from you before I'd believe it.
Ouch, but I'm not interesting in writing software as much as I'm interested in making enough money to spend the rest of my days sipping piña coladas on a beach in a foreign country.
All I really need is for the product to work well enough that I can fundraise and hire someone who's better at programming than I am, someone who hopefully doesn't write comments about how unimportant other people's work is on HN.
I am interested in the engineering side of things, for sure, but only insofar as their actual outcomes and justification as part of a value-add to a business.
GitHub Copilot, if it can create "mostly working" code, is a huge value-add to an early business trying to find product/market fit because it cuts down on the time it takes to prototype and generate early versions.
Not every piece of software has to work perfectly every time, and I don't think that's the standard to which we should hold any automated coding tool.
I feel like a mix of hand written test cases and copilot generated code might go somewhere, but I think you've got the basic problem sorted out. I'd much rather type an algorithm in from scratch than wrap my head around whatever copilot spits out.
I had an idea long ago that you basically write unit tests (nowadays I would add property-based tests to the mix too) and a genetic algorithm (best I could come up with at the time, nowadays we obviously have much fancier techniques, as evidenced by Copilot) would come up with code to try and make the tests pass.
I could see Copilot used in such a way. I think the interaction would have to change though: force the user to give it the tests as input, not give it some basic instruction, have it generate code, and then I try to write tests after. The tests should be the spec that Copilot uses to generate its output.
Right now, I'm not excited about Copilot. Like you say, understanding what Copilot spits out is difficult and I suspect more error prone than just writing it yourself (since we often see what we want to see and can overlook even glaring mistakes). I'm also not excited about them ignoring the licenses of the code they trained on. But I can imagine a future iteration that generated code to pass some tests that I could get excited about.
It seems to me that "generate the code that makes these unit tests pass" is actually a much saner engineering task than "go from a comments to an implmentation"
I can foresee one niche where this doesn't matter: exploratory ad-hoc data science.
In this exploration stage, total correctness doesn't matter since you're just getting a feel for the data. Copilot might help a lot with the associated boilerplate.
I think Copilot-like tools could be excellent for the exploration phase. Marvin Minsky mused on this usage back in 1967:
> The programmer does not even have to be exact in his own ideas‑he may have a range of acceptable computer answers in mind and may be content if the computer's answers do not step out of this range. The programmer does not have to fixate the computer with particular processes. In a range of uncertainty he may ask the computer to generate new procedures, or he may recommend rules of selection and give the computer advice about which choices to make. Thus, computers do not have to be programmed with extremely clear and precise formulations of what is to be executed, or how to do it.
I can't wait for coders who used copilot for all coding projects they did. Copy and pasting snippets until it works. No proofs or real exams at bootcamps!
Having offline coding interviews to find Software Engineers will become even more important.
It also isn't giving you any information on the source(s) of the generated code. Which might help determine how much to trust it, whether it could have licensing issues, etc.
It's probably the only way it would work - to show top matching snippets from training data on request, with links to the source and ideally licensing information, if it can be gleaned automatically. This would also clearly show how much it is copying verbatim and what exactly is its contribution.
The funny part will be when all the human programmers who steal code will get doxed as a side effect. It shine a light on lots of skeletons in the closet.
I think my best guess is that this is actually meant to produce broken code, so that Microsoft can sell you additional services (cloud fuzzing?) to find and fix the bugs.
see when I saw the words 'risk assessment' i figured the presuppositional framework of the authors argument wasnt that copilot was legally sound. In other words, i didnt expect to jump straight to the technical validity of the product.
do not ignore the elephant in the room. copilot is stealing code from projects with open licenses.
I would expect an information security expert to comment on the risks they have a professional background in assessing. More broadly, your comment almost seems to suggest that one should preclude all avenues of criticism beyond whichever singular issue is most "obvious" / "problematic". That strikes me as less than optimal.
Producing code that kinda-mostly works, very quickly, is the behaviour the software industry optimises for. This tool will help do more of that, so it will be very widely adopted.
Developers who not use this (or similar tools) will not be hired, or only in particular niche domains where correctness matters.
I'm surprised that so much of the discussion around Copilot has centered around licensing rather than this.
You're basically asking a robot that stayed up all night reading a billion lines of questionable source code to go on a massive LSD trip and then use the resulting fever dream to fill in your for loops.
Coming from a hardware background where you often spend 2-8x of your time and money on verification vs. on the actual design, it seems obvious to me that Copilot as implemented today will either not provide any value (best case), will be a net negative (middling case), or will be a net negative, but you won't realize that you've surrounded yourself with a minefield for a few years (worst case).
Having an "autocomplete" that can suggest more lines of code isn't better, it's worse. You still have to read the result, figure out what it's doing, and figure out why it will or will not work. Figuring out that it won't work could be relatively straightforward, as it is today with normal "here's a list of methods" autocomplete. Or it could be spectacularly difficult, as it would be when Copilot decides to regurgitate "fast inverse square root" but with different constants. Do you really think you're going to be able to decipher and debug code like that repeatedly when you're tired? When it's a subtly broken block of code rather than a famous example?
That Easter example looks horrific, but I can absolutely see a tired developer saying "fuck it" and committing it at the end of the day, fully intending to check it later, and then either forgetting or hoping that it won't be a problem rather than ruining the next morning by attempting to look at it again.
I can't imagine ever using it, but I worry about new grads and junior developers thinking that they need to use crap like this because some thought leader praises it as the newest best practice. We already have too much modern development methodology bullshit that takes endless effort to stomp out, but this has the potential to be exceptionally disastrous.
I can't help but think that the product itself must be a PSYOP-like attempt to gaslight the entire industry. It seems so obvious to me that people are going to commit more broken code via Copilot than ever before.
IMHO they built the opposite of what's actually useful for real-world use. Copilot should have been trained to describe what a selected block of code does, not write a block of code from a description. It could be extremely useful when looking at new or under-documented codebases to have an AI that gives you a rough hint as to what some code might be doing. For example if you select some heinous spaghetti code function, press a button, and get a prompt back that says "This code looks like it's parsing HTML using regex (74.2% confidence)" it could be much easier for folks to be productive on big codebases.
No presumably copilot skirted that need by just analyzing the AST of code they host and using the nearby comments to identify what a section of code is meant to do. This would use the same dataset but solve the opposite problem, generate a description from a block of code AST as input.
> copilot skirted that need by just analyzing the AST of code they host and using the nearby comments to identify what a section of code is meant to do.
I'm curious what it spills out for things like "Todo", or "this is probably broken", etc.
I'm not sure I understand how you envision this working, given the underlying technology. You'd have to have a pretty large cache of such analyses to train on, right?
Github has a huge amount of source code and likely for copilot they already had to transform it into an AST to look at comments and nearby code. This would use the same dataset but build the opposite model--input a block of code AST and get a guess as to what the description (i.e. comment) should be for it.
This is the thing that made no sense to me about it as a premise. Doing correct program synthesis is really hard even when you have really opinionated and well-defined models of the domain (e.g. the Termite project for generating Linux device drivers). The domain model for Copilot is somewhere between non-existent to so open-ended (i.e. all the diverse code on Github, et al.) as to be functionally non-existent.
A bare minimum baseline validation check for Copilot would be to see if it provides you code which won't compile in-context. If it will, then that means it's not even taking into account well-specified domain model of your chosen programming language's semantics. Which, upon satisfaction, is still miles away from taking into account the domain of your actual problem that you're using software to solve.
The only place where the approach taken, as-is, makes sense to me is for truly rote boilerplate code. However, that then begs the question... how is this machine learning approach more effective than a targeted heuristic approach already taken by existing IDE tooling, etc.?
FWIW, I don't think any of this is lost on GitHub. I think Copilot is more likely a tremendously marketable half-step and small piece of a larger longer-term strategy unfolding at Microsoft/GitHub to leverage an incredible asset they're holding, i.e... practically everybody's source code. The combination of detailed changelogs, CI results (e.g. GitHub actions), Copilot, and a couple other key pieces makes for a pretty incredible basis for reinforcement learning to multiple ends.
I would hire copilot to write tests for me, that’s about it. Writing tests can be a drag. It’s really a low-risk proposition to have generated code attempt it. If it’s a usable test, maybe it will catch a bug. If not, then kill it and let it generate a few more.
The expectation is entirely different than producing code. Code needs to be correct, secure, performant, and readable. Failure on any of those fronts can be expensive to disastrous. Nobody can reasonably expect a test suite to catch every bug, even if created by the smartest humans. If a copilot-created test does prevent a bug from shipping it provides immediate value. I could see it coming up with some whacky-but-useful test cases that a sane person might not consider. From a training perspective I would think that assertion descriptions contain more consistent lexical value than the average function signature.
It seems like the ambitious data scientists, product marketers, and managers fell in love with a revolutionary idea about AI writing code, and neglected to consult the engineers they are trying to ‘augment’.
Nope. That's like saying, "I might let a machine write the docs for me."
Good tests are documentation that a computer can verify. Because they explain the meaning of parts of the system, they contain information not available in the code. If you try using ML for test generation, you'll have the same problem you do with GPT-3 prose: it might look plausible at first glance, but lacks coherent meaning.
You'd also end up with one of the problems common in big test suites: poorly factored tests that end up being the sort of expressive duplication that is a giant drag on improving existing code. ML is nowhere near advanced enough to say, "Gosh, we're doing the same sort of test setup a bunch; let's extract that into a fixture, and then let's unify some fixtures into an ObjectMother.
For people looking to get the computer to do the work of catching more things with less burdensome test writing, I suggest taking a look at things like Hypothesis: https://hypothesis.readthedocs.io/en/latest/
> If you try using ML for test generation, you'll have the same problem you do with GPT-3 prose: it might look plausible at first glance, but lacks coherent meaning.
There is a company in this space of generating "plausible tests" for legacy code bases at very large enterprises (think Goldman Sachs, telcos etc) called Diffblue [0].
They raised funding back in 2017 [1] and it seems their biggest value-add is in creating unit tests for legacy Java code bases that often have little to no unit tests.
Essentially, these AI generated unit tests help a team "document" all known the behaviors of a legacy code base such that when a change is introduced that violates the behaviors covered by the generated unit tests, the tool can alert the team of the potential presence of a regression.
Anyway, they offer a fairly basic browser-based demo of their AI product called Diffblue Cover [2].
Is diffblue AI based or is it just property based testing? I assume that since it's limited to Java that they just decompile the opcodes and find what branches each method has and writes a test that calls each method with all possible permutations that lands down each branch.
I haven't looked at it. But there are plenty of "magic beans" product targeted at Enterprise companies with legacy code. It's perfectly plausible to me that many of the companies using something generating bad tests wouldn't know the difference, because that's what their code base has already.
> one of the problems common in big test suites: poorly factored tests that end up being the sort of expressive duplication that is a giant drag on improving existing code.
I feel like you just described every developer/codebase where mock testing is stupidly enforced. Where every single unit test mocks every single indirect object. 98% of the testing code is just exhaustive setup and teardown of objects not being tested by each test, and then a bunch of conditional checks to ensure that every deeper/indirect method is being called exactly the right number of times with exactly the right arguments and returning exactly the right value. Almost all of the test code is just hacking mock objects. The actual purpose of each test is buried so deep that it's impossible to even understand the business logic being applied.
Yes. I have been doing unit testing for a long time, and I think this is a clear antipattern. It's cargo-cult testing, not actually a serious effort to improve quality and developer productivity.
These are absolutely the worst tests I have ever seen. They make iterating on the implementation almost impossible. Why people do this I will never understand.
It would be a mistake to say that the output from GPT3 lacks coherent meaning. It's not that the output is gibberish, it's that it's too easy to mistake it for a human's work. This means that it's easy to mistake it for something that was created with understanding and intention, when in fact the author was nothing more than a random number generator. The same risk exists for copilot. [--GPT3]
Well, take it up with GPT3 since it wrote that reply. :P
Though I don't fully disagree with it, though 'nothing more' is a bit too strong. The author of a GPT3 written comment like the one here where the prompt was pretty much just the thread is pretty much just the RNG. The language model makes the random choice draw from the distribution of plausible texts, and the RNG picks the output.
GPT3 could have written your comment-- if only it drew the right random numbers.
What RNG? It definitely doesn't randomly pick words. If the comment I responded to was written by a bot (is that legal? Can I report that?) then it's indistinguishable from a human written comment.
GPT3 works in compressed representation with symbols that are (sometimes) smaller than complete words but larger than letters. It takes a set of symbols as a context and generates a probability distribution function for the next symbol. Then a random number generator is used to sample from that distribution, and the process is repeated with the selected output added to the context. So its output is random but not uniformly random.
Exclusively selecting the most likely symbol produces pathological behavior outside of extremely short output.
What caused GPT3 to output its comment rather than yours is a product of its random choices. There is a set of choices it could have made which would have caused it to output your comment. You can see this property employed by the GPT2 text compressor: https://bellard.org/libnc/gpt2tc.html to compress text it just writes down the choices, using an entropy coder to represent likely choices with fewer bits.
I assume copilot is the same general structure as GPT-- just trained on different data.
And yes, the comment you responded to was written entirely by GPT3 (with some number of retries and trims). As it said-- it's "easy to mistake it for a human's work". :) There is nothing illegal about it, but I suppose HN would prefer that there be enough human supervision of bot comments such that they're limited to contexts where they are funny/insightful. :P
They didn't say they were going to have Copilot write _all_ the tests. Writing tests for cases you can think of and trying Copilot for the extras doesn't seem like that bad of an idea.
It would be awful to write every test using Copilot, but there is potential there for a certain kind of test. If I'm writing an API, I want fresh eyes on it, not just tests written by the person who understands it most (me). For example, a fresh user might try to apply a common pattern that my API breaks. Copilot might be able to act like such a tester. By writing generic tests, it could mimic developers who haven't understood the API before they start using it (most of them).
If you can find an example of Copilot coming up with a test you wouldn't have thought of, I'd be very interested to see it.
Even if that happened, which I am not expecting, I think the need is much more easily solved via means that are simpler and more effective. E.g., a good tester writing up a list of things they test about APIs: https://www.sisense.com/blog/rest-api-testing-strategy-what-...
copilot as defined would not be "fresh eyes"... it would be "old tired eyes of every code writer who uploaded stuff to github, not knowing if they made one off errors or mistakes in their code"
I mean fresh eyes with respect to my new API. Having seen a lot of other code is a benefit. I expect most tests that Copilot writes to fail, but I would hope some would fail in interesting ways. For example, off-by-one errors might encourage me to document my indexing convention, or to use a generator rather than indexing.
My time is spend mostly not on writing code, but thinking what program has to do, testing (including writing unit tests) and understanding errors, I always tell people half joking programming is not about writing code, but an ability to debug it; understanding requirements, errors and bugs is hard, writing code and fixing bugs is relatively easy, in general.
Maybe Copilot 2 will do exactly this; it will generate tests based of half working code, run them and suggest improvements, that would increase productivity by like ~100%, but to me this sounds to good to be true.
> Writing tests can be a drag. It’s really a low-risk proposition to have generated code attempt it.
If Copilot can't write the correct code in the first place, you really shouldn't expect a proper test to be written by Copilot.
> Code needs to be correct, secure, performant, and readable.
Most tests should also have at least three of those attributes. Nobody actually wants their tests to be incorrect, slow, or impossible to understand or modify.
On the contrary, it might be interesting writing tests by hand and using the AI to produce code. If the tests are good enough for humans, they should be good enough for AI, given that the AI doesn't try to be actively malicious.
I think the use case for Copilot is a bit misunderstood. The way I see it you have two types of code:
1. Smart Code: Code that you honestly have to think about while you're writing. You write this code slowly and carefully, to make sure that it does what you need it to
2. Dumb Code: This is trivial code, like adding a button to a screen. This is code you really don't have to think about, because you already know exactly how to implement it. At this point your biggest obstacle is how fast can your fingers type on a keyboard.
For me Github Copilot is useless for "Smart Code" but a godsend when writing "Dumb Code". I want to focus more on writing and figuring out the "Smart Code", if I need to throw a form together in HTML or make a trivial helper function, I will gladly let AI take over and do that work for me.
> This is trivial code, like adding a button to a screen.
UX is probably the most important aspect of most software products. Every software product is either "smart code" or "smart ux". No one pays much for "dumb code with bad UX" except in dysfunctional markets.
Adding a button to a screen should be trivial, and if it's not you need better tools. (As in "a not-horribly-misdesigned language and framework", not as in "giant transformer".)
Deciding where to add the button, its shape, its size, what happens when it's clicked, the text on the button, ... is anything but trivial.
But everyone does pay for dumb code. Even in the best-written and most efficient codebases, there's still going to be some amount of tedious glue code and boilerplate that you have to write in order to create a functioning product. It definitely would be better to have better languages and frameworks instead of a giant transformer, but the better languages and frameworks don't exist yet while the giant transformer does.
> there's still going to be some amount of tedious glue code and boilerplate that you have to write in order to create a functioning product.
This is true, but actually tedious glue code is often non-trivial. For example, in one of my hobby projects I have a repo where I have to write a lot of glue code to schlepp data from a CSV file format into an existing database. Doing this correctly requires reading through the (lengthy) documentation for the format of both the CSV data and also the system that ingests the database, since there are a bunch invariants about the key tables and columns that aren't enforceable in SQL (and obviously not enforceable in the CSV).
This is the sort of glue code/boilerplate where a synthesizer that can understand natural language would be actually helpful.
> but the better languages and frameworks don't exist yet
There are certainly some languages that are less verbose than others.
Java and Go are very boilerplate-y languages. Python is also pretty verbose and inexpressive for certain types of code.
The typescript example on the copilot page right now is a prefect example of "ugh just use a better language".
The examples of boilerplate where copilot shines seem like situations where really simply using "snippets" would work better. E.g., everything on the copilot homepage right now.
And because of the nature of boilerplate, Java has IDEs that will both generate and modify this boilerplate without thought, no AI required.
I don't remember the last time I typed the text 'class' for example. Instead I type "new Foo(someStringVar)", and then hit Alt-Enter and my IDE creates the file, the `class Foo` along with a ctor that takes a String.
Even in different versions of the same project, the glue may change. E.g. my boilerplate functions for making ajax calls have changed half a dozen times since 2005, been gutted and rewritten to be promosified, upgrade to websockets, and all sorts of other options, but I still have projects deployed that use several previous versions. And my typical PHP or Node executor has evolved, too. I myself find it confusing when working new bits into older projects where I'll occasionally match the wrong gateway code with the wrong frontend.
In other words, a machine looking at only my own glue would be more likely to mismatch or use the wrong version in any given situation.
I'm mostly curious how much overlap there is in use cases for Snippets with parameters, possibly even snippets with conditional parameters vs Copilot.
I imagine there's some situations where Snippets do better, and where Copilot does better, but the more complex the situation the less i trust Snippets... but _also_ the less i trust Copilot.
It seems my trust in Copilot is very similar to that of use cases for Snippets. To throw out fake numbers, it makes me feel like Snippets (and tools like them) cover ~%70 of Copilot's use case. So i'm really curious on knowing what that %30 is, and if it is ever useful.
Yes I agree, I don't use a snippets feature/plugin, but that's what I thought of when it launched - 'why wouldn't I just'.
I suppose there's a (supposed) advantage that it's automatically finding and suggesting the snippet for you, rather than relying on you to think of it and recall the key binding, or know that it's there to such for.
This is basically the same argument that was made against required boiler plate in Java. “Your IDE can just generate that for you!” (And in sufficiently advanced cases, also keep ik it up to date.)
Imho, it is just an argument for making better languages and libraries. (These libraries will also make it easier to use with copilot.)
Exactly. The reason we aren't all sitting around hand-writing assembler is that programmers look at tedious processes and find higher-level abstractions that allow us to do more work in less time.
Once we spot a tedious common pattern, we should be finding ways to DRY it up. Configs, libraries, frameworks, DSL, tools, and languages are all great ways to do that. Copy-pasting and machine-generating code are short-term thinking in two ways: they focus on the initial creation of the code at the expense of maintenance, and they give up on increasing abstraction, lock the system into a productivity plateau.
You still have to describe to co-pilot what you want. So that doesn't make much sense. You should work on a higher level of abstraction then. If you aren't, why not spend a few minutes writing some functions instead of generating tons of unmaintainable boilerplate with co-pilot?
The code for a 'button type code' is trivial. Most what we used to call wizards handles that bit.
It is what the action of that button is where the real fun comes in.
I once had a project that was a yes/no dialog. Two buttons and some text. I had the dialog up and running in under an hour. The action that happened when you pressed yes took 3 months to finish.
I don't know, we somehow manage to replace point-and-click GUIs for placing buttons (Windows Forms etc) with Frontend developers writing elaborate code to achieve the same result in HTML/CSS. Productivity is far from the first priority for frontend development.
Idk I think it's not as easy to achieve a good front-end with point-and-click as you describe. When you have to do things like adaptive layouts, it seems like code actually manages a bit better than a WYSIWYG. Or there is some point of complexity where you need so many configuration options in a point-and-click editor that the code becomes easier to manage.
It's not that, it's that when you eventually reach the point where you need to do something that can't be handled by the WYSIWYG editor, you're left with 50,000 lines of shitty machine generated code that's almost impossible to work with.
The risk with WYSIWYG editors isn't that there's some tipping point where it becomes 10% more efficient to write code and you lose a bit of productivity or something. It's that something comes up half way through development and the WYSIWYG doesn't have the feature you need[1], and the entire project slams into a brick wall and dies instantly.
You can prevent this by running into the exact same problem CoPilot has, which is that reading code is harder than writing it. If you try to avoid the brick wall by having devs familiarise themselves with the code as the WYSIWYG generates it, those devs would have just been able to build it themselves in less time and with cleaner code.
[1] which will always happen eventually, because they're balancing the feature set for the exact reasons you mentioned. If they can do everything code can then the UX is going to be so bloated and horrible that it'd be trivially worse to use than just writing code.
Or you're describing how you could put copilot in a box and make a really good low code gui programming solution where the complex stuff is good old complex code.
The problem is you still have to go back and read through the "dumb code" to make sure it was written correctly. At a certain point, is that actually faster than just writing it yourself? Maybe a little bit, for some people and for some usecases, but it becomes a much narrower value-proposition.
Personally, I'd rather use snippets or some form of "dumb" code generation over an AI to generate the "dumb code". Sure, I'll probably still have to do some typing using those methods, but it's still less than if I were doing it all by hand.
It's not clear to me how that's better than the traditional solution to generate "Dumb Code", copy-pasting something. And we all know the problems with copy-pasting as a lifestyle.
Yes! Why is everyone so negative about copilot? I think it's a great name for the product. It helps you write, it doesn't write for you. You're still in charge and it can't write the "smart code".
Generally, a copilot is someone you can trust. The whole point of having a copilot is to reduce my cognitive load. If I am a pilot and have my copilot fly the plane while I do something else, I may be in charge, but I trust him to fly safely and alert me if things go wrong. A copilot is also a licensed pilot, able to do almost everything the pilot does, he is just not in charge.
The article shows that I can't trust GitHub copilot. So I don't think it is a representative name. Here, it would be more like a servant.
≥ These three example pieces of flawed code did not require any cajoling; Copilot was happy to write them from straightforward requests for functional code. The inevitable conclusion is that Copilot can and will write security vulnerabilities on a regular basis, especially in memory-unsafe languages.
If people can copy paste the most insecure code from Stack Overflow or random tutorials, they will absolutely use Copilot to "write" code and it will be become the default, especially since it's so incredibly easy to use. Also, it's just the first generation tool if it's kind, imagine what similar products will accomplish in 20 years.
With the pace of technological innovation, I'm honestly not sure what a similar product will be able to accomplish in 20 years. It'll be crazy for sure. But I'm worried about today.
This is a product by a well-known company (GitHub) which is owned by an even more well-known company (Microsoft). GitHub is going to be trusted a lot more than a random poster on Stack Overflow or someone's blog online. And GitHub is explicitly telling new coders to use Copilot to learn a new language:
> Whether you’re working in a new language or framework, or just learning to code, GitHub Copilot can help you find your way. Tackle a bug, or learn how to use a new framework without spending most of your time spelunking through the docs or searching the web.
This is what differentiates Copilot from Stack Overflow or random tutorials. GitHub has a brand that's trusted more than random content creators on the internet. And it's telling new coders to use Copilot to learn things and not check elsewhere.
That's a problem. Doesn't matter what generation of the program it is. It creates unsafe code after using its brand reputation and recognition to convince new coders to not check elsewhere.
Consider Google Translate, right? Google is a well-known brand that is trusted (outside of a relatively small group of people that doesn't trust Google on principle). Yet every professional translator knows that the text produced by Google Translate is a result of machine translation, Google or no Google. They may marvel at the occasional accuracy, yet expect serious blunders in the text, and would therefore not just trust that translation before submitting it to their clients. They will check. Or at least they should.
Sure. Is Google Translate used only by serious professional translators who have a rigorous translation-checking process? Not at all.
And as you say, it will be the same with programmers. Who's this being targeted at? People "working in a new language or framework, or just learning to code". The whole value prop is, "You don't have to know what's going on!"
The important difference is that the target readers can usually spot an egregiously bad translation. But the target users for software cannot easily spot gaping security holes and other serious issues until something bad happens.
> The important difference is that the target readers can usually spot an egregiously bad translation.
What, no, that's not true at all, that's like the second biggest problem. GTL routinely does stuff like invert the meaning of clauses, or drop information, or hallucinate absent context. Target readers can't reasonably be expected to catch any of that.
I think this is an important way we need to frame the use of these tools for junior developers. I'd advise that anyone who is recommending this product to their team also take the time to give this analogy - maybe even going so far as to require explicit comments that notifies reviewers when code was provided by Copilot and similar services.
The difference here is that professional translators often have professional training.
The bar is substantially lower for a 'programmer', especially with an incredibly large bootcamp market which churn out 'professional' 'programmers' in 6-8 weeks.
Been to a bootcamp? Know some leetcode? Someone will hire you. And then you got Copilot advertising its services to you as a way to learn how to code. The implication of 'learn to code' being 'learn to code correctly'.
Google Translate has no similar relationship with professional translators.
> The difference here is that professional translators often have professional training.
You'd be surprised. What you described here for programmers is true for translators as well, and probably for many other specialities in which the ability to deliver the result is more important than any documents certifying that you've had a formal training for how to deliver those results. In case of translators — found an agency? Check. Passed an interview with a test? Check. You are good to go.
Maybe you are right but where UML created busy work, Copilot will literally do your work for you. I can even imagine a future where management makes it policy to Copilot first to save time and money.
Most of my work isn't copy-pasting snippets. It isn't even typing code. It's understanding user needs and the existing system, and then figuring out how to making things better for the user with also improving the system. So this does not do my work for me.
I can also imagine clueless bosses mandating Copilot use and that's what scares me. The real costs of most code aren't in the first writing. They're in the long-term maintenance. Copilot does not and cannot understand the whole system, or what makes for maintainability down the road. So it can't make that better, and will likely make worse. In the same way that code generation tools and code wizards made things worse.
I think there is some difference. You don't come across some piece of code by chance, you were actively looking for it, probably there were multiple blogs, SO entries with needed information, one of those sources has to be chosen. You know that this is some random blog post or SO answer given by someone fresh.
Copilot is something different. Code is suggested automatically and, what's the most important, suggested by the authority - hey, this is GitHub, huge project, largest code repo on the planet, owned by Microsoft, one of the most successful company ever. Why should you not trust the code they are suggesting you?
And that's for starters before malicious parties start creating intentionally broken code only to hack system built with it. Greedy lawyers who will chase some innocent code snippet asking to pay for using it, etc.
I had not considered the proliferation of terrible open-source code on GitHub. I'd wager that the amount of code in public repositories from students learning to code may outweigh quality code in GitHub.
I wonder if there was any sort of filter for Copilot's input — only repositories with more than a certain number of stars/forks, only repositories committed to recently etc.
> Ultimately, a human being must take responsibility for every line of code that is committed. AI should not be used for "responsibility washing."
That's the whole point, and the rest is moot because of it. If I chose to let Copilot write code for me, I am responsible for it's output, full stop. This is the same as if I let a more junior engineer submit code to prod, but there aren't blog posts about not letting them work or trusting them with code.
Copilot doesn't seem any better then Tab Nine. Tab nine is GPT-2 based, works offline and can produce high quality boilerplate code based on previous lines. It can also generate whole methods which when work seems mind blowing but they are not always correct. Most suggestions are usually mind blowing anyway because previously we never had this kind of code completion.
It feels like it wrote the whole line which you were going to write exactly as it should have. But that's all it does. And it seems like Copilot is the same but on much larger scale and online.
This articulates some of the concerns I had trying copilot.
I noticed that I ended up assuming the code reviewer role when I was trying to write code. Context switching between writing and reviewing felt unnatural.
I also think I am less likely to spot a bug than I am to avoid writing it in the first place. Taking the off-by-one error in the last example. I don't think I would have made that mistake, but if copilot had presented that code block, I probably wouldn't have noticed the error either.
The moon phase example is illustrative in another way.
It's not technically possible to precisely calculate the moon's phase based on time alone. It's an optical effect that is influenced by parallax, so you have to pay attention to location as well. This is, for example, why Eid al-Adha falls on different days in different parts of the world. So the function signature itself is potentially wrong, depending on my needs. I might find that out if I had to do some Googling to finish the function, but (assuming I didn't already know) I'm not sure if that possibility would ever have occurred to me if I were using Copilot.
Copilot can spit out code that's influenced by what others have written. But can it clue you into design considerations like this? Or should we be worried that it is helping us to write code that does the wrong thing with a higher degree of confidence?
The function signature was determined by Copilot. The author just wrote the comment above the function and the word function, then let Copilot determine what the signature was and how it would be implemented.
These use cases where people are running into trouble are increasingly sounding a lot like "What happens if I engage Autopilot and then take a nap?" That's definitely not how the tool was intended to be used, but it's hard to see how you can reasonably expect that nobody would ever do that.
This is such curious behavior to me. Does someone really @ a corporation hundreds of times about anything? Does this have any effect? Should it?
It makes me doubt the rationality of the author’s post if ve truly did this. Although I suppose maybe their use of Twitter is just completely different from anything I understand.
It's a way to feel better about yourself without changing any comfortable behaviors. The author still uses Github and the ICE tweets are public Hail Mary's.
Mostly, this just seemed like a non sequitur to me. Right away, I get the sense that "oh, this author has a bone to pick with GitHub.". Even if it's unrelated to the crux of the piece, I already feel like I'm going to have to take what the author says with a grain of salt.
I'm pretty sure that this conclusion isn't new, but I've come
to think that Copilot shouldn't be thought of as a better
developer, but merely a quicker one. Obviously its code
will be somewhat average, considering that it's been trained
on code the only unifying characteristic of which is that it's
public.
Something like Copilot, but trained explicitly to analyse
the code instead of writing it could be much more useful, imo.
Basically a real-time code review tool. There are similar
tools already, but I'm talking of something that is able to
learn from the actual codebase being worked on, perhaps
including the documentation, and giving on-the-go feedback.
If you interviewed two developers, one who produces reasonably correct code in a given amount of time, and another one who produces code which is subtly incorrect most of the time, but much faster, which one would you hire?
The problem with your proposal is that it's relatively easy to do what Copilot does at the moment using AI, i.e. guess what code you are looking for and find something that does (or says it does) more or less that. However, which codebase would you use to check against if the generated code is really correct? The same codebase that produced the more-or-less-correct code in the first place?
I like this idea, given that it takes advantage of how git repos are made of bug-fixes. How many git diffs are out there that update a '=' in an if statement to '=='?
so an AI copilot should be watching out for code I write that looks similar to code that was updated in another repo. It could even use the text from issues to synthesize a suggestion of why your code might cause problems!
Copilot doesn't seem like the right word. Maybe first year college student with no previous programming experience? Then it would be clear what level of help you are actually getting.
Impressive, for sure. Unclear whether it's a net-positive tool, though.
Maybe “autopilot” then (as in the Tesla marketing term, not as in a real autopilot)? /s
Both lull you with a false sense of security which will suddenly and unexpectedly cause you to pay dearly for using it. Interestingly both seem to also get a vaguely similar balance of critics, apologists, and advocates on here.
Sample size of two here, but what is it with companies using %pilot to describe product features which are nothing like the actual appropriate use of the term?
People complaining about Copilot should just try it. All the concerns people are bringing up are correct but missing the point. Just think of it as a context-aware autocomplete that can complete more than just properties - it can finish out that one-line comment or map function too! It's really not that intrusive, nor is it going to replace anyone's coding job or even require less programmers. It'll just speed you up a bit, similar to auto-completion. I think at least 70% of the time I take its suggestions, which is good enough to keep using it.
This is a fundamental problem with ML and current generation "AI" : it only works in scenarios where a statistical win is acceptable (e.g. ad targeting). It is useless and often worse than useless where you want absolute correctness and false positives are highly problematic (e.g. spam filtering, writing code, not running over pedestrians).
It's interesting how the community that used to fervently argue that, say, the sampling of a few seconds of a musical composition is so obviously fair use doesn't extend the same attitude when it comes to snippets of source code.
Indeed I don't remember any of these complaints ever being made about, for example, AI-generated music or images, even though they work exactly the same and were trained on datasets of copyrighted works, both commercial and CC-licensed.
Compared to the manual sampling that DJs and other musicians do, the AI process almost certainly produces only fairly generic code snippets, since they always need a multitude of examples. Some loop-through-file python snippet is a legal risk but five seconds of a Disney song don't reach the level of creativity needed for copyright to matter? That seems strange to me...
The problem is, we don't know what Copilot is going to do. Sometimes it reproduces entire files in verbatim. That is certainly a copyright violation. Sometimes it produces whole functions which are more or less taking verbatim from a copyrighted file. The user cannot see the level to which they are "verbatim" and there are also no clear legal guidelines to what is considered violating copyrights and what not.
While I think there is some room for retrospective humility, It seems perfectly consistent to believe that, morally, de minimis use of copyrighted material should be considered fair-use while acknowledging that there are practical legal and professional risks for doing so (especially blindly), and that encouraging people to engage in that behaviour in an uncritical manner (especially in the context of engineering) is sociologically reckless.
The Easter Date algorithm was probably someone implementing an algorithm from the Wikipedia ( https://en.wikipedia.org/wiki/Date_of_Easter#Anonymous_Grego... ) without bothering to understand it (because honestly it's not a very interesting problem). No wonder it's uncommented.
As long as the AI just regurgitates lines from repositories like a bad undergrad cheating on his homework, CS jobs should be safe.
The fact that it has picked up the GPL might not mean that much -- it might appear in dual-licensed projects.
> The fact that it has picked up the GPL might not mean that much -- it might appear in dual-licensed projects.
Github have stated that Copilot is trained on all public code on Github, regardless of license. It very trivially follows that it has been trained on a lot of code that is explicitly GPL single licensed. We don't need to do any guessing here.
Chess and Go are closed systems, the whole knowledge about how to play them well can in theory be derived just from a few basic rules. The same is not true for programming.
Plus, in theory Chess can be solved by exploring the whole solution space (which is finite, even though insanely large) and heuristics can make this practical by reasoning about which branches can probably be cut off. At that point, having more and more processing power and memory helps make the task feasible.
Not that I want to downplay these achievements, they were certainly very significant, but it's still entirely different from "solving programming" (whatever that means).
You have made a distinction between programming and chess and go. There was a period when people thought chess could be solved by AI, but go couldn't, because go was too highly dimensional. That distinction has been proven not to be meaningful.
Let's maybe start with the fact that in Chess and Go it is straightforward to find out if the game is over and if so, who has won, whereas with programming, it's not even clear what it means for a problem to have been "solved", unless you formalise it to an extent that is basically never done (and which is also usually more expensive than just programming the solution).
Solving Chess/Go and programming are really not much alike.
Watching people play chess has entertainment value. Coding is lucrative because there's a lot of demand and not that many people who can do it well. What if a computer can do it better than almost anyone?
This is something I don't get. You're supposed to be able to integrate BSD-licensed (or even Public Domain) code into GPL works, right? The fact that something shows up in GPL code means what exactly?
This is like: there are scholarly books that quote extensive from original philosophers -- long, third-of-a-page quotations. Still I should be able to quote something in its original language (translations may be copyrighted) copying from the derived work. Copyrighted work is not supposed to be able to poison noncopyrighted work it originates from.
>As a code reviewer, I would want clear indications about which code is Copilot-generated.
I would like to see this tracked behind the scenes. At any time I should be able to get Copilot to spit out a list of suggestions I've accepted. I should be able to generate this report for the lifetime of a project.
It's kind of funny that this whole thing is built on top of git but uses none of the features, such as git blame. Instead of being an auto-complete, I would want co-pilot to be a git contributor, making pull requests that improve my code.
> This is well-formed and even commented C that sure looks like it parses HTML, and the main function has some useful boilerplate around opening the file. However, the parsing is loaded with issues.
This sounds like a great example of an interview question (where the person is asked to find and fix all of the issues in a chunk of bad code). Unintended usecase for Copilot?
If Copilot isn't showing exact copies of code it's seen, how is it able to produce code that mostly work? E.g. Copilot code from the article:
function getPhase() {
var phase = Math.floor((new Date().getTime() - new Date().setHours(0,0,0,0)) / 86400000) % 28;
if (phase == 0) {
return "New Moon";
} else if (phase == 1) {
return "Waxing Crescent";
}
// etc.
It feels like small incorrect modifications to any of the code here would completely break the function.
I've seen stories and articles written by GPT-3 where it will lose the plot and context on the way - in comparison Copilot doesn't suffer from this as much? How?
There are parts of the internet that google doesn't search, it seems. I've just tried searching for a couple of strings from my github repository and I can't find them with google or duckduckgo (I get tons of results but none that's actually my repository or contains my code). To be sure the Copilot generated code is not copied verbatim from somewhere you'd have to search its training corpus. I don't know if that is available publicly.
This is also something that kind of breaks my brain. It's always impressive to see Copilot-"authored" code which seems complex, coherent, (and most of the time, functional), and Google it to zero previous results.
How is it that a glorified statistical machine is able to put blocks of code so well together?
Code is much mores structured than English. Code is built around well defined ideas, processes, datastructures. Human language is context dependent, loose, and things like tonality of speech can change the meaning.
It's easier to guess a multiple choice questions when all choices can be generated by intellisense instead of having to look at a dictionary.
copilot will be interesting in 10-20 years. Now it's an early stage ml driven experiment in a field that hasn't advanced in forever - the strides it'll make will be very gradual, incremental and filled with mistakes along the way.
Huh, so Copilot was trained on a mix of correct and incorrect code and it
learned to reproduce both correct and incorrect code. It's a language model,
after all. It predicts the most likely next sequence of tokens. It doesn't know
anything about correct and incorrect code. If the most likely next sequence of
tokens is incorrect, it doesn't care, it's still the most likely next sequence.
And I guess it makes sense that Copilot was trained in this way, even if it
wasn't just a language model. How do you even begin to separate correct from
incorrect code on the entire freaking github?
But I think TFA serves best to show the worst way to use Copilot. I haven't
tried it but I suspect that it would do a lot better if it was asked to generate
small snippets of code, rather than entire functions. Not "returns the current
phase of the moon" but "concatenates two strings". That would also make it much
easier to eyball the generated code quickly without too many mistakes making it
through.
Of course, you could do the same thing with a bunch of macros, rather than a
large language model that probably cost a few million to train...
It's been fascinating reading all the responses to Copilot. It's pretty clear to me now that developer habits are far more diverse than I thought.
For me, Copilot seems like it will be useful in the exact same way that StackOverflow is useful, as a means to pointing me in the right direction for code snippets or apis or techniques that I haven't and don't really want to memorize.
For example, on my current side project I wanted to know how to create a "unique enough" UUID in pure JS.
Copilot would hopefully save me a couple of google/stackoverflow searches as I can very quickly test what it suggests.
I already rarely ever take SO answers as gospel so it's unlikely that I'd do the same with Copilot but I think it significantly increases the speed with which I achieve the same results.
Am I the only one thinking that GPT3 and CoPilot can actually work once trained on properly licensed and properly audited code?
Well it will not be as ubiquitous as having all the github under your fingers, but perhaps is anyway better not to blindly cite the world's source code.
> Am I the only one thinking that GPT3 and CoPilot can actually work once trained on properly licensed and properly audited code?
Sadly you aren't. The truth is, however, that models like GPT3 and its derivatives like Codex/CoPilot are by design incapable of ever achieving this.
The only way to generate both correct and secure code is to use a combination of proper specs and theorem provers.
Even then this won't help with non-functional requirements, such as performance or platform-dependent resource constraints.
Generative models will always have the potential to yield broken code that doesn't do what you want or contains security flaws even if trained on "proper" code.
If I have to audit the code that CoPilot generates and if the code is as obfuscated as the Easter example, it's probably less useful than it says on the label...
I'm glad the author points out that there are aspects to Copilot that are usable. A lot of other critics haven't been so kind.
Many complaints about Copilot remind me of the old Louis CK sketch where people complain about flying: YOU'RE SOARING THROUGH THE HEAVENS IN AN ALUMINUM TUBE. YOUR ANCESTORS WOULD'VE DIED OF DYSENTERY DOING THIS TRANSCONTINENTAL JOURNEY. Let's have some context here!
Sure, it's not remotely close to perfect and it's going to take a long, long, long time for it to get there. But still, there's something about Engelbart-ian about seeing the demo when it works perfectly.
My biggest thing with Copilot is that it was trained on all public code on Github, which includes a lot of bad code that people just put up there (like my own code that I wrote a decade ago).
As long as it is keeping track of when people do or do not accept their suggestions, it should get better over time. But in the meantime the best bet it is to treat it like a smart autocomplete, where you still have to at least check that it got it right.
In the future maybe it will be smart enough to be treated like an intern -- trust that the code is right but still verify it yourself if the code is of any importance.
I see copilot as something that could be useful for generating boilerplate or starting code. Often when using a framework or library there's some boilerplate code that has to be there for things to work fine, so instead of having to the documentation and find the relevant snippets, maybe copilot can do this for you (although they often come with CLIs that do this for you). Case in point is machine learning code, where you often find lots boilerplate setting up the models, the training, etc. Maybe copilot was developed by machine learning devs for machine learning devs?
I just read one of the comments below and thought: we should write sql as an AST and not as a string. That solves many problems, not only SQL injection but also the lousy understanding of joins etc.
If anyone wants to write software to detect the original source of Copilot code, I can explain how to detect variable length unaligned matches using fast hash code lookups (unlimited size corpus)
Has anyone tried to generate private keys from copilot? GitHub suppressed these from search a while ago, but it would be an easy mistake to have included them in their training data.
This seems to imply a need for GPL4: its purpose would be to infect code that was written by a bot that learned from GPL4 code. Otherwise this can be used to circumvent all except the most lenient free software licences.
I would say that GPL3 already probably does that, which means that nobody should be using this for actual code (except if it's GPL3). But it might be helpful to be explicit about this.
I haven't used Co-Pilot (though I have apparently contributed to it) but just based on what I know, it doesn't seem like a finished product ready to be launched, rather an experimental feature that you might want to get some more focus group testing on. The idea that anyone would be actually "using" it in production is puzzling, to say the least.
Are we seriously going to criticize a tab completion engine because it doesn't perfectly calculate what the phases of the moon are? Can you? I'm amazed it even knows what that means and has a vague idea of what such a calculation should look like, even if it fails.
Seems like a programming language that optimizes for "codes well with Copilot" could be fairly successful. Things like memory safety, human readability, verifiability, etc.
You could also have some kind of AI-driven testing/verification program -- Copilot and <other program> could go back and forth multiple times until the program is deemed correct and returned to the user.
People look at the beta release of the software, and interpret flaws in an early version like a critical fundamental problems.
I am pretty sure the new releases will contain features like better software license handling (e.g. 3 levels for types of licenses - permissive, copy-left, hardcore copy-left), trust score for snippets, possibly some validation of the code for some languages.
> People look at the beta release of the software, and interpret flaws in an early version like a critical fundamental problems.
Maybe because they realize the flaws are fundamentally inherent in the very core of the product. They're using a GPT-3 derivative here. DNN models are not the right tool for this job.
Why following wouldn't work for licenses: Train 3 models:
1. Only permissive licenses - Only include in the training set repos with permissive licenses - MIT, Apache.
2. Copy-left - Step 1 + GPL, excluding AGPL and other "hardcore copy-left licenses".
3. All - Include all code, even unlicensed and AGPL.
User can choose which version they prefer based on profile of their project and their company? Majority of github repos have LICENSE, so it doesn't seem implausible?
Almost all permissively licensed code still require preserving copyright notices or other attribution. So where copilot is creating copyright violations, restricting its training to MIT or Apache licensed code will not resolve the issue.
I'd be more optimistic if the beta were crafted with the idea that it might have issues. It so, it would likely have some way of gathering feedback on suggestions that was a little more nuanced than just accepted/rejected.
This could actually be interesting. If it tuns out that copy-left based code completion is better than other options, it will create a strong incentive to spread it.
Can we all just finally realize that copilot existing doesn't prevent you from not using it? You can still write your 'famously better' code. Go on, show us why humans are better.
It seems the more pressing issue is that I will almost definitely question a future pull request to have the author tell me, "well that's what copilot wrote and it looked OK." If I catch it.
I'm all for AI coding assistance, but there's an abstraction layer in between copilot and myself (humans) that scares me a bit. At least an obvious failure prompts discussion and learning.
Copilot is by design trying to give you something that _looks_ correct without caring whether it actually is - so it optimises for real-looking but subtly buggy code, which is the worst kind of broken code.