Building a no-code toxicity classifier by talking to GitHub Copilot

abeppu · on March 25, 2022

We're all focusing on the weaknesses of co-pilot (the comments can be longer than the code produced; you need to understand code to know when to elaborate your comment, etc).

But also ... what do you need to know to recognize that the concept of a 'toxicity classifier' is likely broken? We can do _profanity_ detection pretty well, and without a huge amount of data. But with 1000 example comments, can you actually get at 'toxicity'? Can you judge toxicity purely from a comment in isolation, or does it need to be considered in the context in which that comment is made?

Maybe you don't need to know about python, but if you're building this, you should probably have spent some time thinking and grappling with ML problems in context, right? You want to know that, for example, the pipeline copilot is suggesting (word counts, TFIDF, naive Bayes) doesn't understand word order? Or to wonder whether it's tokenizing on just whitespace, and whether `'eat sh!t'` will fail to get flagged b/c `'shit'` and `'sh!t'` are literally orthogonal to the model?

More people should be able to create digital stuff that _does_ things, and maybe copilot is a tool to help us move in that direction. Great! But writing a bad "toxicity classifier" by not really engaging with the problem or thinking about how the solution works and where it fails seems potentially net harmful. More people should be able to make physical stuff too, but 3d-printed high-capacity magazines don't really get most of us where we want to go.

ShamelessC · on March 25, 2022

> We're all focusing on the weaknesses of co-pilot (the comments can be longer than the code produced; you need to understand code to know when to elaborate your comment, etc).

See, this tells me you may not have even used copilot. Because while tutorials such as this (and the OpenAI codex tools) have you use comments explicitly to code, the reality is that you're not hammering out plain english requirements for copilot to work. You just code - and sometimes it finishes your thought, sometimes it doesn't. You hit tab to accept autocomplete, just like you would for any other autocomplete. So you are generally reading and evaluating what copilot thinks is a good output and choosing whether it goes in the program or not with the TAB key.

ImprobableTruth · on March 25, 2022

Copilot is great as a 'smart auto-complete' or when you need to do pattern based drudge work... but that's not what this article is about. It's trying to sell people on copilot as a no-code tool.

The leading question is this:

>But as helpful as it is for coders, what if it enabled non-engineers to program too – by merely talking to an AI about their goals?

and it answers this in my opinion deceptively by presenting what amounts to a parlor trick. Whether copilot in general is any good or not is in my mind totally separate to this.

ShamelessC · on March 25, 2022

Yeah, I agree Copilot is absolutely not a no-code tool.

frozencell · on March 25, 2022

Less no-code, more low-code high-tongue tool.

chockchocschoir · on March 25, 2022

It doesn't actually say that at all, because you can use Copilot in different ways. One way is the way you mention, by writing code and letting Copilot finish those off. Another way is the way GP describes it (and, the technique that the article uses) where you write comments and let Copilot fill out the code.

Just because one uses one of the ways doesn't mean they are not aware of the other way too.

ShamelessC · on March 25, 2022

Not logically, no. But it is implied because you actually get both such experiences on-demand in VS Code/vim/emacs. It's a fascinating experience and you find yourself writing more descriptive function names and variable names rather than using handwritten instructions. You quickly realize that comments are just one of many prompt engineering tricks available once you have access to this - and simply generating snippets as the linked article does is quite restricting sometimes.

Basically, the concern that e.g. comment length gets too long is a weird one, because you don't tend to actually use copilot that way if you have access to it through tab-complete.

Perhaps what I really mean is - people should try using copilot for an actual coding project. Its benefits aren't really obvious in contrived examples.

BeefWellington · on March 25, 2022

A few years ago I did some work with IBM's Watson Twitter integration. One of the fun things you could do was sentiment analysis. It was reasonably accurate for the extremes but anything in the gray area would be wildly off. A politely worded tweet that was scathing would come across high on the positive sides of the scale, whereas a perfectly reasonable sentence that included profanity as used in a quote would immediately be high on the negatives.

This part from the article made me chuckle, because IMO the author fell for some of the most basic language processing smoke & mirrors:

    …so we’ll give it some examples. When generating the array, it even creates the ideal variable name and escapes the quotations.

Here, it generates toxic_comments as a variable name, when the instructions were:

   # create an array with the following toxic comments: [etc]

This is pretty basic language parsing stuff that might have been kicking around awhile. I think the most basic english language parser could output something along the lines of what was suggested, given an understanding of what valid Python should look like. While impressive, it's not nearly as interesting or good as the rest of the work being done.

Copilot appears no different to most ML models out there. Poor and incomplete training data will yield ok results for popular things but as soon as you ask for edge cases it will fall apart like Siri trying to understand a Scottish accent.

Eventually it might get there with enough good representative training data but it's unclear to me how long that will take. If it tracks with speech processing models it might take decades plus.

Another consideration is that because the training data is being done using github public repos (at least last I read), it's likely that it's ripe for abuse. If that's still how they're doing it I'm looking forward to the TEDTalk in two years from a researcher who "hacked" the copilot AI by polluting its training data.

visarga · on March 25, 2022

> I think the most basic english language parser could output something along the lines of what was suggested, given an understanding of what valid Python should look like.

OK, I am waiting for you to propose a basic language parser that can do it. There's a reason we're only now having this debate - it was unconceivable 5 years ago, in the era of basic language parsers.

BeefWellington · on March 25, 2022

> OK, I am waiting for you to propose a basic language parser that can do it. There's a reason we're only now having this debate - it was unconceivable 5 years ago, in the era of basic language parsers.

This is really untrue. In fact, making "English as a programming language" was a goal of many older programming languages such as COBOL[1], BASIC, and PASCAL as early as the 60s. It's hardly a new idea and was hardly inconceivable "5 years ago" for something to output a programming language.

The sentence example here could easily be broken down by the ParseTalk model from the mid-90s[2].

Here's a recent ish example (2018) of someone developing a "fully English" programming language:

https://osmosianplainenglishprogramming.blog/2018/05/02/plai...

It's also a source of fun[3][4][5] for people.

These are all examples of either programming languages straight up using English as syntax, or lexical parsers that can break down language and provide you with the programmatic ability to make this kind of output.

The difference here is that while copilot is pulling in python examples based on its training data set, that one thing the author singled out for amazement could easily be done by these older non-ML methods. The value copilot is adding in the example is just outputting python compared to those other methods. The real value is way larger than that, pulling in potentially more complex code to accomplish a complete task.

It's a bit like seeing an all-electric cargo train and being amazed that a train can run on electricity, when electrified light rail has existed for a long time. The impressive part is not that a thing on rails can use electricity to move around, it's the fact that it can pull heavy cargo efficiently enough to make electric power viable.

[1]: https://en.wikipedia.org/wiki/COBOL#COBOL_60

[2]: https://arxiv.org/abs/cmp-lg/9410017

[3]: https://github.com/RockstarLang/rockstar/blob/main/examples/...

[4]: https://en.wikipedia.org/wiki/Shakespeare_Programming_Langua...

[5]: https://github.com/lhartikk/ArnoldC/wiki/ArnoldC

sdoering · on March 25, 2022

Years ago at pyData Berlin I remember a talk trying to classify comments from three major online newspapers with the question if we. Could detect where a comment was made.

One newspaper was left leaning, the other had the reputation of right wing trolls commenting and one was somewhat in the middle ground with a reputation of the audience being pseudo intellectual neoliberalists.

The 'center' (most typical) comment for these three sites totally was in line with these sentiments. The perfect proof (or confirmation bias).

But the classification didn't work. While there were clear cut cases (one has to love stereotypes) most cases were just neutral. Meaning they could have been made on any of these media sites. Either they were just too short or just not extreme enough.

I feel (used explicitly here) that toxicity is not something that is easily classifiable without deeper understanding of the context. Else, if feeling a comment was toxic was the measure one would need to query all walks of life from extreme left to extreme right and afterwards would probably be left with a lot of toxicity that doesn't tell us much except that different people will find different things toxic.

jayceekay · on March 25, 2022

didn't watson turn out to be useless and spaghetti code inside? aka ibm's marketing arm

BeefWellington · on March 25, 2022

I'll preface this by saying that my time working with it was while I was working at IBM, so feel free to take this with a grain of salt. In my time since I've worked in a few Data/ML and Security positions, so I do have a basis for comparison with other systems.

From what I saw, the actual language-processing part of it was top-tier. It's just it's a hard problem to come up with a demo for that people will actually respond positively to, hence the Jeopardy stint. It has limited real applications. It's really good at what it does but what it does isn't really widely useful.

Nobody wants to see "We're going to replace all our online help / support chat stuff with Watson" because people find those systems frustrating already, even if it would make things vastly better than some of the alternatives.

So you end up with weird stuff like Chef Watson, Doctor Watson, and so on -- things in areas where an ML model isn't going to replace a human anytime soon.

Then Marketing gets involved and suddenly anything that uses any kind of ML needs to have Watson slapped on it, even if it's not doing any language processing.

esjeon · on March 25, 2022

Welp, you're downplaying IBM too much. IBM got the product direction right earlier than anyone. Watson is a querying system w/ advanced NLP/IR/KRR capability running on dedicated compute chips, and large corps are more or less following this path. It's just that IBM did it too early and used rather old approaches, which doesn't grow well (thus "spaghetti").

Still, Watson is pretty much the only one in its class. There are good alternatives out there that worked well for many people, but they offer only a subset of Watson's feature set. If an organization need some real bang, Watson is the only option.

joshspankit · on March 25, 2022

We must suspend disbelief a bit regardless: Any “toxicity classifier” has a limited operational life as people who want to say toxic things will simply adapt their language and walk circles around it.

From simple letter substitution (sh!t) to completely different words/concepts (unalive) to “layer 2 sarcasm” (where someone adopts the persona of someone who supports the word view that’s against what they believe in a non-obvious attempt to rally people against that persona).

People have been getting away with being toxic in public for a long time. ML cannot keep up. Humans can’t even keep up.

echen · on March 25, 2022

(Post author here.) Agree with both you and the parent here! We work a lot in the NLP and Trust & Safety space, and many of the models and datasets we see do ignore context -- and so real-world "toxicity models often end up simply as "profanity detectors" (https://www.surgehq.ai/blog/are-popular-toxicity-models-simp...). Which would certainly happen with a Naive Bayes model as well.

Similarly, a lot of the training data/features ML engineers use ignore context -- for example, a Reddit comment may seem hateful in isolation, until you realize the subreddit it's in changes the meaning entirely (https://www.surgehq.ai/blog/why-context-aware-datasets-are-c...).

Regarding your point, we actually do a lot of "adversarial labeling" to try to make ML models robust to countermeasures (e.g., making sure that the ML models train on word letter substitutions), but it's pretty tricky!

IshKebab · on March 25, 2022

The fact that "toxicity" is not well-defined or black and white and you'll never be able to reach 100% accuracy is extremely obvious and not very interesting. That's probably why nobody is talking about it.

qsort · on March 25, 2022

Sure, but we probably can work on that a little more rather than throwing in the towel and saying 'toxicity is when text matches regexp'.

robertlagrant · on March 25, 2022

Well, we probably could throw in the towel. The definition of the word is ever-changing and context-dependent, AND subjective to the receiver. That doesn't sound like something you can train a model for.

achenet · on March 25, 2022

If you had access to the reactions of someone reading the content, you could _possibly_ train an agent to spot textual patterns likely to cause the reader to have negative reactions.

You could do a similar thing with a robot DJ, by feeding it a stream of the dancefloor, and training it to keep that dancefloor grooving.

robertlagrant · on March 25, 2022

But think about how this is trained. As with all of the authoritarian anti-offence rhetoric (i.e. not person to person politeness, but politeness enforcement), the response should be: who gets to decide?

Some concepts become less offensive over time; some more offensive. 20 years ago gay marriage was offensive in many parts of the world. Should that be codified into our communication tools? Offence is in no way objective, and this will never change.

There is genuine, vast utility in advocating for this sort of thing, but only if you want to be the person with the power to decide what everyone else is allowed to talk about.

Spivak · on March 25, 2022

Would you say the same about a less divisive speech pattern like a flamebait/flamewar classifier? Because it’s already double with simple heuristics like upvote/comment ratios and seems like a fine fit for a moderator assisted classifier.

IshKebab · on March 25, 2022

Yeah I would say the same. It's also not well defined or black or white but that doesn't mean you have to just give up. You can do better than nothing in both cases.

tomerv · on March 25, 2022

The first comment asks Copilot to import all the libraries needed for a toxicity classifier, and it imports libraries such as re (regex engine) and nltk (natural language toolkit). But what if I wanted a classifier for toxic chemicals and not toxic speech? That was my first thought when I saw "toxicity" in the title.

I'm now imagining a very frustrated junior developer a few years from now trying to argue with Copilot to write code for a classifier for chemical compounds, but it just spits out code for classifying text.

blackoil · on March 25, 2022

Like googling is an essential skill for developers. In few years if Copilot deliver its promise, navigating it would be treated the same. You may also have an interview round wherein optimization would be how quickly can you get Copilot to write the expected code.

junon · on March 25, 2022

If this is what software engineering turns into, put a bullet through my head please.

caymanjim · on March 25, 2022

I haven't tried Copilot, and based on what I've read, I don't think I'd want to use it in its current form, but I'd love for software development to evolve to the point that I never have to write any boilerplate code again. Even with DSLs, code generation, autocompletion, snippets, and countless libraries and frameworks to draw upon, the bulk of what I do as developer is write the same boring code over and over again. I welcome the day that a predictive IDE allows me to focus on the interesting aspects of problem solving without having to do all the tedious bits, while still allowing me to inspect and modify the tedious bits when needed.

visarga · on March 25, 2022

I am eagerly waiting the day an AI will collaborate with me like that. You want one for writing the tedious bits of code, I want one for managing information in a research project, others want one to automate tasks RPA-style.

tekknik · on March 25, 2022

This will make you less valuable, unless there’s enough interesting problems coming down the pipe to keep us all employed. And then once they expand copilot to also solve the hard problems then what’s the use for you anymore?

This very much seems like cheering on the destruction of your career.

dgb23 · on March 25, 2022

The boilerplate problem has been solved decades ago. You can use macros or code generation. Letting Copilot do that for you seems a bit like an ugly workaround. We sometimes need those, but at some point we should be taking a step back and design a program instead of hacking away.

eurasiantiger · on March 25, 2022

Copilot can already write your boilerplate for you once it has a single example of how to do it.

madrox · on March 25, 2022

It's not that far off today from how junior engineers treat stack overflow. There's just fewer intermediate steps.

KMag · on March 25, 2022

I think we all cargo-cult our way into programming and then over decades get better and better understanding of what we're doing and why.

I remember being 7 years old and not really understanding the AppleSoft BASIC manual. I memorized the variable names in the code examples, not realizing I could name variables whatever I wanted. A$, B$, LEFT_FOOT$. The latter was there to be obvious that one could name variables whatever one wanted, but 7 year-old me didn't understand. I remember talking to an older kid on the school bus whose father worked for IBM, and the confused look on his face when I started rattling off my list of memorized variable names. I can still picture his face, but I forget his name. I'm pretty sure he was Mike. Thanks for straitening me out, Mike (Schmidt?).

At least we're getting more effective cargo cults.

madrox · on March 25, 2022

This maps to my experience as well. Think many senior engineers forget what it's like to learn programming. It is, in fact, incredibly hard.

corobo · on March 25, 2022

Just bill yourself as an artisanal developer and 5x your rates

Aeolun · on March 25, 2022

You can ask copilot to do it for you!

ripper1138 · on March 25, 2022

Funny you point that out. In this specific example, what if I just googled “Python comment toxicity classifier” and looked for a complete solution that way? It’s like a semi-smart googler in your IDE.

visarga · on March 25, 2022

> In few years if Copilot deliver its promise ...

That's ages ago in AI time.

ShamelessC · on March 25, 2022

> I'm now imagining a very frustrated junior developer a few years from now trying to argue with Copilot to write code for a classifier for chemical compounds, but it just spits out code for classifying text.

So just the future version of a junior developer not knowing how to use their tools? Yeah, that scans. Still sounds incredibly useful however. The alternative is, of course, a junior developer fumbling as they try to write said program entirely from their learned skills and experience.

dgb23 · on March 25, 2022

That’s a very important thing to experience.

There’s not always a quick fix or easy path. You can’t always patch existing stuff together or just wait until the problem goes away.

And when a tool helps you too much, then is there really a point in what you’re doing? It’s not even a learning experience anymore.

ShamelessC · on March 25, 2022

> And when a tool helps you too much, then is there really a point in what you’re doing? It’s not even a learning experience anymore.

Have you seen the documentary about AlphaGo? After watching it, and seeing Lee Sedol just utterly devastated by losing to a computer, I felt like I too would surely feel the same thing in my life. I mean, surely Lee Sedol is a far more skilled Go player than I am a skilled programmer.

Anyway, sorry for the rambling. I agree - it's deeply important to _actually_ learn how things work. That's why I wouldn't recommend copilot to a junior dev. Unfortunately, the way things are going - those junior developers are going to use it anyways and I tend to be more of a realist than an idealist.

dgb23 · on March 25, 2022

That’s what computers and copilot are good for. Finding solutions for well defined problems based on data we feed them. They can’t for example design a novel and interesting video game, or build a bespoke tool for an SME.

StopHammoTime · on March 25, 2022

Just to clarify, it's not really no-code: pseudocode is the new bytecode it would seem and this is just compiling that into usable code.

You still need to be able to code and understand what you're doing. You can't just ask simple questions and get complex answers. You still have to be capable of asking complex questions.

A common scenario I can think if is where I struggle to remember the name or API of the exact thing I want to do but I know exactly how it works - typing that in and getting a result would improve my workflow, but it's just saving a trip to Google, we're not talking the difference between doing and not doing, just a saving a minute.

I would rate the value of this more as interesting rather than useful, simply because as another commenter highlighted it's just easier to write code. It could be useful incrementally but not for everything.

CallMeJim · on March 25, 2022

Note that in part of the process, Copilot was the one asking complex questions when the human programmer didn't know how to proceed.

Copilot adds tremendous value for someone who knows what they want, but not how to do it.

For example, I'm not a great programmer. I'm also a lazy programmer. I had to convert a time to a specific format, in a specific timezone in JS, and I couldn't be bothered looking up documentation for Date.toLocaleTimeString (or is that Date.toLocaleString?).

I wrote a comment outlining exactly what I wanted: // given a date in ISO format (and UTC timezone), return the time in hh:mm AM/PM format (and x timezone) and immediately Copilot generated the code I was after.

Making something easier can definitely mean the difference between doing and not doing — I've taken on a lot of projects I wouldn't have attempted without Copilot.

zkldi · on March 25, 2022

> I wrote a comment outlining exactly what I wanted, and immediately Copilot generated the code I was after.

How do you know it was what you were after? Like you said, it could be .toLocaleTimeString or .toLocaleString (or something else).

How do you verify that the AI isn't giving you broken/incorrect code? I guess you could check the docs, or run the code yourself, but at that point what's the value add for copilot?

injidup · on March 25, 2022

The negative comments seem to assume an open loop development strategy where if copilot fails to give the 100% correct result it is a fail. Rather, even if it is wrong it can get you close and if not close it can give you ideas. You have to close the loop and use your own intelligence as well.

For example I can't draw faces but I can recognize a badly drawn face. If I ask an AI: Please draw me a 35 year old man with receding hair and crooked teeth I can quickly validate the result is fit for purpose. If it is not what I want I can modify the query. I then learn quickly how to prompt the AI to give me what I want.

In the example you give we can assume that the AI has produced a plausible option even if wrong. For example a scenario may be:

   # User: Write a comment "Convert the date to the current local for printing"
   # Copilot: generates the method 'toLocaleString'
   # User: Mouse hover over the method to get the documentation for the 'toLocaleString' method
   # User: See that the method produces the wrong output. We realize we don't want the date
   # User: Modify the comment to "Convert the date to the current local for printing time only"
   # Copilot: generates the method 'toLocaleTimeString'
   # User: Yes this is the one I want. Moves on

The key point is you have to know what you want and be able to recognize a correct result. Validating a correct result is often easier than coming up with the correct result. You have multiple strategies to validate the result.

Testcases, compilation, code review, documentation, IDE intellisense

This obviously gets harder the larger the amount of code copilot is being asked to generate. But good software engineering practises still stand. Try to keep your functions and modules small and to the point.

islon · on March 25, 2022

> For example I can't draw faces but I can recognize a badly drawn face. If I ask an AI: Please draw me a 35 year old man with receding hair and crooked teeth I can quickly validate the result is fit for purpose.

But code is not a face: you can't easily judge if it's correct or not, if you could you wouldn't need copilot in the first place, so now you have to trust it's correct and, if it isn't, you need to search for the correct answer anyway.

alright2565 · on March 25, 2022

They provided an example situation right under "For example a scenario may be:".

Their experience matches mine in my use of copilot.

croon · on March 25, 2022

Verifying is inherently both easier and less energy intensive than producing.

I can critique a great book I couldn't write. I can marvel at John Carmack's early iD code without having been able to come up with it. I can be immensely impressed by what golfers produce for a mundane problem.

I'm not saying this is what copilot produces, but the concept could absolutely be useful, in theory.

vicda · on March 25, 2022

I used copilot to learn the crufty parts of bash and to pick up swift from zero. The smaller the problem you're trying to solve the easier it is for copilot to generate it perfectly for you.

Think of it like a snippet engine on steroids. It's a huge value add.

bobsmooth · on March 25, 2022

>How do you know it was what you were after?

By testing it

>but at that point what's the value add for copilot?

Not having to look up the docs and writing it yourself.

mordymoop · on March 25, 2022

Right, I don't know why people are so invested in minimizing how powerful this tool is. I have never properly learned Javascript but I can write Javascript using Copilot because copilot doesn't really make syntax errors, and I can recognize correct semantics, e.g. is it probably doing what I mean for it to do. And then I can, as you say, just test it.

It also bears repeating that Copilot will only get better.

chronolitus · on March 25, 2022

Reminds me of this post by Scott Aaronson: https://scottaaronson.blog/?p=6288

"Forget all that. Judged against where AI was 20-25 years ago, when I was a student, a dog is now holding meaningful conversations in English. And people are complaining that the dog isn’t a very eloquent orator, that it often makes grammatical errors and has to start again, that it took heroic effort to train it, and that it’s unclear how much the dog really understands."

deadbeeves · on March 25, 2022

Because those points are key to whether the technology can (or is worthwhile to) be evolved further. Is a dog that just holds conversations without any understanding what we want, or do we want the dog to eventually take on more interesting tasks? Can we get there with this method or not? How much effort would it take?

Technology that can only make you go ooh and ahh is pretty useless.

bmitc · on March 25, 2022

I don't really understand this. You're not coding directly in the language, but now you're coding in an implicit language provided by Copilot. From what I've seen on Copilot, although it is an impressive piece of tech, all it really points out is that code documentation and discovery is terrible. But I'm not for sure writing implicit code in comments is really a better approach than seeking ways to make discovery of language and library features more discoverable.

And I know it sounds silly and like "I had an idea like that once" (see Office Space), but I actually came up with the idea for or at least a similar one to Copilot in an off comment to a coworker back in like 2014 or so. The idea was that as you wrote code, it would display on the side similar code that had been written by others doing the same or similar thing, and then it would allow you automatically upload small processing functions to some sort of cloud library. Same thing for doing autoformatting, although that's less of a concern now that formatters are becoming popular. The context I was working in was visual languages though. I had even started writing a tool during an "innovation week" (that I never showed) that would start visually classifying whether code written in the visual language was "good" or "clean" or not. I never got anywhere with it and mainly just have some diagrams generated from that project that were buggy so that they kind of look like art.

ShamelessC · on March 25, 2022

You "came up" with the idea for intelligent autocomplete? And are you aware that this project actually required big innovations in language modeling and a supercomputer? Because I would say that is far more central to the concept behind the tech than the interface.

bmitc · on March 25, 2022

An idea is not an implementation, and I clearly mentioned it was an offhand comment in a casual conversation. My "idea" was exactly what I described above. Nothing more. I'm sure several had this idea, and Copilot was probably already in development. My comment was just a way to give a personal anecdote. I'm not sure what your point or complaint is. Did you somehow miss the reference to Office Space? It wasn't a serious claim. Just a segue to some thoughts I had.

ShamelessC · on March 25, 2022

okay yeah - I apologize. In the context of other comments it seemed a little more dismissive of the tech itself. I see now that you were quite clearly going for humility. Should have caught it on the first read however, sorry again.

bmitc · on March 25, 2022

Not a problem. :) Context and tone is hard in text. I felt silly saying that but I did have the idea I mentioned. I have a pretty good track record of having ideas I have no clue how to implement. Haha. Why the visual programming analysis project went nowhere. It's like tech and programming shower thoughts.

The Copilot tech is completely of my league.

kcorbitt · on March 25, 2022

What funny timing! Just this week I've actually been working on an open source VS Code extension that uses OpenAI's new code edit API[1] to let you write or edit code in your IDE by typing instructions.

And as a bonus related to the article title, it literally lets you talk to your editor (ie you can press the keyboard shortcut and then give edit commands by voice[2]). I've been leaning on it heavily for the last few days and the setup feels really productive!

If you want to try it out you can install it here: https://marketplace.visualstudio.com/items?itemName=clippy-a...

You can also find the full source code here: https://github.com/corbt/clippy-ai/tree/main/vs-code-extensi...

I'd love feedback!

[1]: https://openai.com/blog/gpt-3-edit-insert/

[2]: I just wrote the voice command interface yesterday and it's still highly experimental. Relies on having ffmpeg installed on MacOS and doesn't work with all audio setups yet. But there's a clear path to making it more robust.

jwithington · on March 25, 2022

this is cool! to clarify, it's different than the existing Copilot extension because this lets you edit existing code and uses voice commands?

hartator · on March 25, 2022

Notice that the comments used to generate the code via GitHub Copilot are just another very inefficient programming language.

dwohnitmok · on March 25, 2022

There is nonetheless something extremely valuable about being able to write at different levels of abstraction when developing code.

Copilot lets you do that in a way that is way beyond what a normal programming language would let you do, which of course has its own, very rigid, abstractions.

For some parts of the code you'll want to dive in and write every single line in painstaking detail. For others `# give me the industry standard analysis of this dataset` is maybe enough for your purposes. And being able to have that ability, even if you think of it as just another programming language in itself, is huge.

MichaelBurge · on March 25, 2022

Programming languages have syntax and semantics, while text-generators are statistical. So I wouldn't call them a programming language, since "having well-defined semantics" is more fundamental than "is often used in an edit->run loop".

michalhuman · on March 25, 2022

What makes it inefficient? It is verbose and similar to natural language. Given that code is more often read than written, isn't the code that is easier to understand more efficient?

______-_-______ · on March 25, 2022

On the other hand, most code is read more often than it is written, and those comments are very readable!

bobsmooth · on March 25, 2022

Notice that the C used to generate the machine code via the compiler is just another very inefficient programming language.

junon · on March 25, 2022

As with most harmful speech classifiers (even classic models) this most likely won't catch the more passive aggressive remarks. Those worded innocently but imply something terrible. I've had a 100% success rate getting these sorts of models to tell me asking someone to "kindly end their own life" is not rude, toxic or harmful.

esjeon · on March 25, 2022

Not really no-code. Let's be honest. The OP is taking steps just like how an experienced SW developer would. Copilot simply cut the need for reading through documentations. This doesn't really say that Copilot can replace programmers.

p.s. Does anyone know when Copilot will update the insecure example on their website? Or are they just trying to be honest with the possible quality issues with the generated code?

saurik · on March 25, 2022

I mean, I would hope they wouldn't try to "update" it by manually changing the results (as, as you note, that would be horribly dishonest).

Ozzie_osman · on March 25, 2022

This is a game-changer, even if it doesn't work 100% of the time. I only infrequently need to use notebooks and dataframes, I'd say once every few months. Frequently enough that I have a vague idea of what I need to do but not frequently enough that I can remember syntax.

With this, I don't need to memorize the syntax OR be bottlenecked on looking at documentation or stack overflowing the commands I need.

qayxc · on March 25, 2022

> With this, I don't need to memorize the syntax OR be bottlenecked on looking at documentation or stack overflowing the commands I need.

In other words: you're celebrating the fact that a tool allows you to become more and more incompetent.

I don't have much hope for future generations at this point.

Aeolos · on March 25, 2022

People said that during the switch from assembly to C, and then again from C to managed languages. Yet, at each point, there are more and better software engineers, solving more and more challenging problems at each step - the modern web would not exist if all we had at our disposable was 80's era assembly.

Aren't you at least a bit curious what new possibilities this technology could enable? What new discoveries could e.g. an expert doctor or a biologist achieve given access to programming tools without spending decades learning programming?

rob74 · on March 25, 2022

I disagree that the problems are more and more challenging - "modern web" applications mostly only do what desktop applications did in the 80s/90s, just with some added complexity due to client/server and trying to use a technology that grew "organically" and wasn't designed for building such applications (HTML/CSS/JS).

Also, while the programming languages you mentioned did evolve to higher levels of abstraction, one thing that didn't change was that you still were telling the computer what it should do. Of course, you could still run into problems when the abstractions you were relying on didn't do quite what you were expecting, but now you have Copilot giving you globs of code that supposedly do what you want to do. How are you supposed to check if it really does that if you haven't got the slightest idea about programming?

Aeolos · on March 26, 2022

I don't think there's any objective measure by which software has not become more complex over time.

The uefi bootloader contains more code and complexity than your average 90s OS. Your smartphone is running at least two, possibly more operating systems. Desktop software of the past decades did not have to deal with even a fraction of the security considerations of even a simple web app. The base runtime of your typical managed language alone is more complex than complex desktop apps of the past.

> How are you supposed to check if it really does that if you haven't got the slightest idea about programming?

This one is easy: you check if its output matches your expectation. In the same way you don't need to know how to program a calculator in order to use one.

If you want to get fancy, you could even ask it to encode your expectations as a test suite.

You may be surprised to hear that a large amount, maybe even the majority, of academic research does not use source control, unit testing, etc, yet they still manage to get work done.

I've even encountered this in industrial research from teams in large companies that you'd expect to know better...

Ozzie_osman · on March 25, 2022

Well, for starters, I'd love to count myself as a "future generation" but statistically speaking I'm older than most people in our field by a good margin.

More importantly, only about 10% of my job time is spent doing work that is hands-on technical work. I'd say probably 1% of that total time is spent doing notebooks with dataframes. Whether I am competent or not is in no way determined by whether I can memorize the syntax to how to group by and count a dataframe. In fact I'd argue it's probably a poor use of time.

Whether memorizing things like syntax is part of competence or not is highly dependent on context. The ROI of me memorizing that specific syntax would probably be highly negative.

I'd fathom there are countless examples like that. There are people who only rarely need to code. There are people who code a lot but only rarely need to use a certain library or language. For people like that, making the code more accessible is a huge win (that includes IDEs, auto-complete or easy links to documentation, and things like Copilot).

lkschubert8 · on March 25, 2022

Is memorization of an API surface competence? This feels very much like a "back in my day" argument.

pech0rin · on March 25, 2022

echoing a bunch of comments but this seems sort of like a nightmare. its like the classic “dont use comments that are exactly what the code is doing”. basically you are requiring writing this type of boilerplate comments which are completely useless but are now so the machines can write the code for you. i guess if you could have some tool that auto-removes these comments afterwards it wouldn’t be terrible but i just see this as a way to have people completely forget apis and then not actually be able to find more powerful tools in a language just living on the rails that copilot provides for you. overall seems like a step backwards, especially if newer devs use this as a crutch when jumping in. now we have a generation of devs who dont actually understand the way things work.

i guess stack overflow has a similar problem but at least there people provide documentation, explanation, and helpful links. this just force feeds you some code. i dont see this as a positive movement for our industry as a whole

aaaaaaaaaaab · on March 25, 2022

>now we have a generation of devs who dont actually understand the way things work.

Can’t wait for this to be true! I will be treated as a demigod compared to them. Job security for life!

arciini · on March 25, 2022

This is really pretty impressive. I think Copilot for these kinds of one-off analysis tasks where specific data manipulation rather than structuring abstractions makes a lot more sense. Structuring libraries or building UI requires a lot more understanding of potential users - in that case, writing the requirements is honestly the harder part.

softwarebeware · on March 25, 2022

I'm out almost immediately. The first comment is more text than the code that it produces.

______-_-______ · on March 25, 2022

That happens sometimes when you move up an abstraction level. I bet "self.count += 1" is a lot longer than the machine code it generates.

rodiger · on March 25, 2022

Falls apart when we get down to binary, but holds across most levels of abstraction

0xFF0123 · on March 25, 2022

Only if binary is represented as ASCII, which is an unfair comparison

lofties · on March 25, 2022

Think of GitHub copilot as StackOverflow on steroids -- a quick way to write code when you're not sure how to achieve what you're trying to do.

After all, "How to parse a CSV file in Python" is longer than "csv.reader(file)" but without knowing that "csv.reader" exists, you have no other way but to tell Google what you need.

klabb3 · on March 25, 2022

> Think of GitHub copilot as StackOverflow on steroids

This is how I already think of co-pilot, but these steroids seem to be mostly for prototyping.

SO often have comments and context such as "this works with 98% of browsers", "this isn't recommended, try X instead", "this works but can break library code because it changes the global scope", "this stopped working in version X" etc etc. Context like this can be important to take into account depending on what you're building.

MichaelBurge · on March 25, 2022

Start a line with

    // CAVEATS:
    // POTENTIAL ISSUES:
    // Above is deprecated. Use below code instead

and ask Copilot to auto-complete.

klabb3 · on March 25, 2022

// this should never happen

chockchocschoir · on March 25, 2022

    fn classify-toxicity(text: string) {
      do_work(text)
      while true {}
      // this should never happen
    }

neurostimulant · on March 25, 2022

Soon we can all quit our programming jobs and become managers managing copilot by writing copious amount of comments to persuade copilot into generating our software products. Finally no coding required!

da39a3ee · on March 25, 2022

I think you're missing the point: this is starting to open the door to people who can't code.

faddypaddy34 · on March 25, 2022

I think the point is if you cannot/do not know how to code you cannot confirm what co-pilot is doing. Especially when it comes to complex topics like drawing context from natural online language using machine learning.

Aeolos · on March 25, 2022

That's where testing comes in, no?

Entering the programming field then becomes an iterative loop of you instructing an AI to generate code, generate tests and iterate / re-adjust until it does what you want.

"You" as in the next generations of programmers in a decade or two.

ShamelessC · on March 25, 2022

I adore copilot and use it daily, but I'm pretty sure if I had always depended on it, I wouldn't be able to properly parse correct from incorrect programs.

It's a really really cool tool and a lot of these comments are just shallow dismissals from people who haven't actually used it and like to be reactionary on the internet because that's the world we live in apparently. But I think it works best when it's used by people with experience.

Hopefully future models with higher accuracy and research in grounding can get us to that point however.

wojcikstefan · on March 25, 2022

1. This is not “no-code”. You still have to read & understand the code Copilot generates.

2. I’m very skeptical of a small group of people reading a bunch of online comments and deciding what is “toxic” and “non-toxic”, even more so when it’s done with no clear definitions/guidelines. As their GitHub repo [0] says:

> Rather than operating under a strict definition of toxicity, we asked our team to identify comments that they personally found toxic.

[0]: https://github.com/surge-ai/toxicity

stitched2gethr · on March 25, 2022

This is actually pretty impressive. More so than I expected, and I sincerely hope this opens the door to simple solutions for those who are still learning or don't code often.

That said, this isn't the robot that replaces us, obviously. Making the process of getting to 80% faster is better for everyone, but the last 20 is tough and anything further needs real expertise. I like how promising this is for the masses.

vba616 · on March 25, 2022

I thought at first this was a classifier for the toxicity of no-code solutions.

For instance, Microsoft Power Automate should rank highly.

rogue7 · on March 25, 2022

This is impressive, Copilot knows scikit-learn better than the data scientist that I am.

Loeffelmann · on March 26, 2022

I've been using copilot for a bit now and it's honestly really impressive. I was skeptical at first and didn't really believe all the praise but it works so well. You still have to understand what the code is doing but more often then not copilot spits out a out of the box working solution. It is phenomenal at writing tests. I can pretty much tell it "write tests for this function" and it will do it with surprising Quality and maybe even goes through cases I haven't thought about.

I think this technology will really shake up how we code.

TauNeutrino · on March 25, 2022

It's an AI writing another AI, the miracle of guided reproduction! As programmers we should appreciate the subtle meta in that.

It is also highly symbolic that the first AI (copilot) was created to save humans from repeating toil, while the second (classifier) is about controlling and limiting us.

I believe the author chose to apply his method to this particular example intentionally for the two above points, not because of the hype of toxicity.

holografix · on March 25, 2022

This would be awesome for crap I don’t want to learn like CSS

bradleybuda · on March 25, 2022

So, has anyone asked Copilot to write a better Copilot yet?

hoosieree · on March 25, 2022

    // generate 1e13 different versions of bubble sort and add to db

fsargent · on March 25, 2022

I seriously thought that GitHub CoPilot was suggesting how to find new kinds of sarin gas. https://www.theverge.com/2022/3/17/22983197/ai-new-possible-... How long until it does?

eurasiantiger · on March 25, 2022

You could already ask it to do this.

MrYellowP · on March 25, 2022

I'm not sure people understand how utterly dystopian and fascist this is. It's like people believe that this is a good thing, instead of understanding how totalitarianism is spreading literally everywhere.

"In the name of what's Good & Right, you have to behave how we want you to ... or else."

spyremeown · on March 25, 2022

What are you talking about, dude? How is an AI that generates code fascist?

DeathArrow · on March 25, 2022

I would like to see a project adding together capabilities of both Autopilot and Intellicode. Copilot uses GPT-4 and GitHub project for training and is giving suggestions based on few lines, Intellicode is reading the whole project and is giving suggestions based on that.

boredumb · on March 25, 2022

The last thing this world needs is automation around people calling things toxic or problematic.

amelius · on March 25, 2022

This works because there's a lot of ML code out there, and it's all very much the same.

ah27182 · on March 29, 2022

The page is not working anymore, getting a 400 error

linkdd · on March 25, 2022

Can we ask Copilot to write a proof for the collatz conjecture? or P=NP?

achenet · on March 25, 2022

we could ask, but would have no garantee such a proof would be correct.

Metacelsus · on March 25, 2022

From the title I thought it would be about chemical toxicity.

eric4smith · on March 25, 2022

Impressive BUT.

Who is defining toxic speech? Where is that data being taken from?

This is the definition of using AI to set what the edges of “speech” should be based on potentially flawed data.

This is a clown world.

r3trohack3r · on March 25, 2022

> In this example, we’re using the Copilot extension for Visual Studio Code, and a free toxicity dataset that we built;

(Emphasis mine)

Following that link:

> Surge AI is a data labeling platform and workforce. Our labeling team pored over tens of thousands of social media comments to build this toxicity dataset. Each comment was then evaluated by multiple members of our team to determine its severity level.

bobsmooth · on March 25, 2022

I feel so sorry for the labeling team. Hope they were paid well.

Xorlev · on March 25, 2022

I think you missed the forest for the trees. It isn't the model that matters, it's that copilot is building the classifier from intent (comments). It wouldn't matter if it was classifying flowers instead.

eric4smith · on March 25, 2022

No. I did not miss it. The work is pretty good.

My problem is with the dataset and datasets like this overall that sets the tone through AI of what is acceptable and what is not.

hombre_fatal · on March 25, 2022

This is absolutely insane. I had no idea Copilot was this good.

The negativity here just seems like sour grapes or weird goal posts.

Sure, it makes mistakes and needs verification. But know what also makes mistakes and needs verification? All the code I already manually write as I tediously ratchet towards a solution. Removing some cycles from that process is a win.

Just stubbing out close-enough boilerplate is a win by itself, like setting up an NLP pipeline or figuring out which menagerie of classes need to be instantiated and hooked up together to do basic things in some verbose libs/langs.

csomar · on March 25, 2022

Copilot is insanely brilliant and good. Its only issue is that it takes too little context (up to what your cursor is pointing on the file, at least on vim). If it did take all context (your whole project, maybe your shell history, your data files, the imported libraries code, GitHub repo issues/PRs, etc...) and it had some LSP checker for errors, add all of that to GPT-4; and maybe we'll have something that can do complex coding stuff auto-magically.

Mountain_Skies · on March 25, 2022

Even though it is based on mostly human written code, Copilot makes mistakes that are different from the type human coders typically make. It will take a different skill set to detect and correct the errors made by systems like Copilot. The same is true for self-driving cars. This doesn't mean that we shouldn't use these technologies, just that there will be adaptations to our behavior we'll need to make if we want to make use of them.

fn1 · on March 25, 2022

> Even though it is based on mostly human written code, Copilot makes mistakes that are different from the type human coders typically make.

Can you give an example for this?

Satam · on March 25, 2022

Probably not what OP intended but here's what I encountered. When autocompleting code for a Pythonic database query it would finish something like "query.all()" as "query.all().delete()". Meanwhile, as a programmer, I'm usually quite vary of deleting all the records from a database unless I'm very sure that's what I want.

ShamelessC · on March 25, 2022

> The negativity here just seems like sour grapes or weird goal posts.

Indeed. Every negative comment I have seen here has been a shallow dismissal by someone who clearly hasn't engaged with the tool. I'm not sure why people here are so primed to shit all over anything potentially innovative, seemingly even without background knowledge. Like, is there something inherently offensive to coders about a model that threatens to do their job? Or is it just years and years of people getting burned by previous "AI" projects without knowing that this one is actually rather impressive and comes from good research?

Keep shallow dismissals to yourselves people. It's in the site's rules.

xodjmk · on March 25, 2022

You can't imagine how some people might have an adverse reaction to a low-barrier of entry arbitrarily defined self-appointed moral policing 'AI' tool generating framework? Not all ideas are good ideas. It doesn't mean the ops are not talented, just misdirected.

ShamelessC · on March 25, 2022

> low-barrier of entry arbitrarily defined self-appointed moral policing 'AI'

So, a software developer?

Just kidding. Perhaps it would clear things up to know that "toxicity" classification is simply an introductory topic in natural language and machine learning. It is an interesting "problem" to try to solve precisely because of how ambiguous language gets. As far as I can tell, the article is mostly not concerned with the quality of the the classifier and is meant to be a proof of concept.

In any case, people have been coding and using such classifiers all over the internet since their inception. Believe me - this isn't accelerating that and the damage is likely mostly done. My advice? Self-host a web server.

dash2 · on March 25, 2022

You might think it's awesome and well-executed, and still think that an automated toxicity classifier is a terrible idea.

ShamelessC · on March 25, 2022

It's simply a fun toy tutorial and likely good introductory material for people trying to learn language modeling and classification, two important topics in the broader NLP/machine learning scene.

The article isn't making any suggestions about whether or not they are good or bad ideas.

IshKebab · on March 25, 2022

> I'm not sure why people here are so primed to shit all over anything potentially innovative

Maybe jealousy - people often downplay others' achievements to make theirs feel better. Or pride - "I don't need no stinking AI assistant! What are you saying? I couldn't write this myself?". I find the latter is a common reaction to static types too.

ShamelessC · on March 25, 2022

> Maybe jealousy

Somehow, I suspect this the most. It's the defensive tone they tend to strike, I think. Particularly apparent in threads about new research. Lots of "I actually had this idea", and "my concern for this is [slippery slope fallacy here]".

I sincerely hope that I never become so egotistical about my own achievements (or lack thereof) that I instinctively despise those who achieve more. Fuck that.

nimih · on March 25, 2022

This is definitely a very cool tech demo, but I got the same feeling reading this as I do when I read a blog post years ago where a guy walked through using very rigid green-red-green TDD to solve a hairy algorithmic problem[0]: it sort of seems like the person already had the shape of the solution in their head before they started writing the code.

Which is maybe the point! As the article points out, remembering the correct incantation to get matplotlib to spit out a bar chart is hard[1]; I certainly have to look it up literally every time (well, these days, I just use tools which have more intuitive APIs, but that's maybe besides the point). I don't really know what it means to "binarize" a dataset, but apparently the language model did, and apparently seeing the giant stack trace when trying to plot a precision-recall curve was enough to prompt the article writer to realize such an operation might be useful. When you're doing exploratory analysis like this, keeping a train of thought going is extremely important, so avoiding paging back and forth to the scikit-learn documentation is obviously a huge win.

But, on the other hand, this isn't a "no-code" solution in any real sense, because for all intents and purposes the author really did all the difficult parts which would've been necessary for a "fully coded" solution: they knew the technical outcome they wanted and had very good domain knowledge to guide the solution, and, shoot, they still ended up needing to understand semantics of the programming language and abstractions they were working with in that stacktrace at the end. It's still extremely neat (and, presumably, useful) to see the computer was able to correctly guess at all the syntax and API interfaces for the most part[2], but I don't really think you can fault people for wanting to push back against the idea that this is somehow fundamentally transformative, since I think it's pretty obvious that the human is (still) doing the hard and interesting parts and the computer is (still) doing the tedious and boring parts. Maybe people shouldn't be getting flustered about a click-baity title over-promising a hip new technology, but as you say:

> Or is it just years and years of people getting burned by previous "AI" projects without knowing that this one is actually rather impressive and comes from good research?

There's definitely some of this.

---

[0] I wish I could find the link for this, but I'm very bad at google these days.

[1] To risk ascribing agency to a statistical model of github commits, it is sort of funny that the co-pilot pulled in seaborn as a dependency but then did everything directly with calls to plt and DataFrame.plot.

[2] I don't really have the expertise myself to tell you whether that scikit pipeline is at all reasonable, I suppose. It sure sounds fancy, though.

ShamelessC · on March 25, 2022

Thank you! This was researched and informed by the article we're all meant to be discussing. That is all I ask, ha.

I 100% agree that this is not a "no-code" solution as is defined. On the other hand, I wouldn't really mind a definition of "no-code/low-code" that involved this - but indeed, not there yet.

And yes, more-or-less I generally agree that this is a tool that must be used by experienced developers. So I can see how a false claim of "no-code" (basically defined as devs-not-needed) would trigger folks to be a little defensive.

However! If you are a programmer with experience, and you want to make your work just so much easier, then copilot is a great tool. I implore you to try it yourself with VS Code/Vim/emacs rather than using openai/codex as the autocomplete is what makes it great.

skhr0680 · on March 25, 2022

> Sure, it makes mistakes and needs verification. But know what also makes mistakes and needs verification?

The problem is when it makes something that looks OK but does the opposite of what you want it to. See: machine translation

xodjmk · on March 25, 2022

Please add "No-Code" and "Toxicity Classifier" to your toxicity dataset.

ibeckermayer · on March 25, 2022

Banger

dusted · on March 25, 2022

This comment will of course be down voted, I'll attribute this to selection bias caused by the headline of the article.

You can't classify a comment as boolean toxic, toxicity does not exist in a vacuum. To extend the analogy from it's biological counterpart, toxicity depends on the organism. You should never just a piece of text in isolation and draw any conclusion about it. It must understood in context, both that of the subject, the recipient and the sender.

ShamelessC · on March 25, 2022

I mean, what you're saying just isn't really directly on-topic. The article's focus is a a copilot tutorial, clearly meant to be illustrative rather than literally used in production. So it comes across like you're criticizing the article for doing something it isn't really concerned with doing to the degree you are expecting.

Does that make sense?

dusted · on March 25, 2022

It does make sense.

However, the framing of the tutorial is clearly about using automated censorship at scale.

Someone is going to roughly copy-paste this into some forum software and call it a day.

IshKebab · on March 25, 2022

You'll be downvoted because that is obvious, irrelevant and has no practical consequences.

account42 · on March 25, 2022

If I had to guess, most downvotes GP got were for "predicting" downvotes.

dusted · on March 25, 2022

I've found that sometimes, "predicting" downvotes fails, and the comment is upvoted, it seems to correlate more to whether people agree with the sentiment of the rest of the comment, but I agree that the prediction could influence the outcome.

nixpulvis · on March 25, 2022

Fuck people who think they can define speech patterns in datasets like this. Especially since I am required to request permission to view their "Elite" documents.

This is some dystopian shit right here. I don't care what fancy models you train on it, or even what funny jokes you make of it. I'm just so done with this.