This is looking at the wrong metric. I'm not expecting it to be 100% correct when I use it. I expect it to get me in the ballpark faster than I would have on my own. And then I can take it from there.
Sometimes that means I have a follow on question & iterate from there. That's fine too.
> I expect it to get me in the ballpark faster than I would have on my own.
This is great if you are an experienced developer who can tell the difference between "in the ballpark" and fixable and "in the ballpark" but hopeless.
There are articles every day about how AI is replacing programmers, coding is dead etc, including from the nvidia ceo this week. This kind of thing shows we are not quite there yet. There are lots of folks on twitter etc who rave about how genAI built a full app for them, but in my experience that comes with a huge amount of human trial and error and understanding to know what needs to be tweaked.
> This kind of thing shows we are not quite there yet
I think you need to probably consider the time it took to go from 100% wrong, to 90, to 80, etc…. My guess is that interval is probably shrinking from milestone to milestone. This causes me to suspect that folks starting SWE careers in 2024 will not likely not be SWEs before 2030.
That’s why I tell my grandkids they should consider plumbing and HVAC trades instead of college. My bet is within 10 years nearly every vocation that you require a college degree will be made partially or completely obsolete by AI.
I tell my grandkids that vocational school is a perfectly decent and honorable way to get into a trade that pays better than retail.
I also tell them that a good university is a perfectly decent and honorable way to begin a life of the mind. It's not the only way, and a life of the mind isn't the life everyone wants.
I also tell them that the purpose of a university education is mostly not about training for a job.
> This is great if you are an experienced developer who can tell the difference between "in the ballpark" and fixable and "in the ballpark" but hopeless.
While true, Stack Overflow wasn't much different. New devs would go there, grab a chunk of code and move on with their day. The canonical example being the PHP SQL injection advise shared there for more than a decade.
That’s how you, an experienced programmer, use it.
What does this do to beginners that are just learning to program? Is this helping them by forcing them to become critical reviewers or harming them by being a bad role model?
Stack Overflow is a community with an answer-rating system and there is often some level of review from other people commenting on the answer's advantages and shortcomings. You often have multiple answers to choose from too. Those features build trust in an answer or prompt you to look elsewhere.
The UI for an LLM answer would have difficulty replicating the same thing since every answer is (probably) a new one and you have no input from other people about this answer too.
Edit: After writing my reply, I saw that roughly four other people (so far) made the exact same point and posted it a couple of minutes before. I think your question is a good one (made me think a little) and apologies it feels like you're being piled on.
I've never been that fond of SO, but I find chatgpt very useful. SO tends to be simpler one time questions. My interactions with gpt are conversations working towards a solution.
LLMs totally have a beginner problem. It's much like the problem where a beginner knows they need to look something up but can't figure out the right keywords to search for.
Also chatgpt has never called me an idiot for asking a stupid question, having not read the question properly, and making the assumption it was the same as an existing question after skimming it. I wouldn't ask SO a question these days, the response is more likely to be toxic than helpful.
Because on Stack Overflow you also see feedback from hopefully a cross section of the developer community, several methods of solving the problem, voting feedback on those solutions, and can sus out a proper solution vs. just having one answer fed back to you.
Because people on stack overflow don't lie to the person who wrote the question very often. A correct answer to a problem that isn't the same as your problem is a better resource for learning than an incorrect answer to your exact problem.
Agreed; but isn’t this on the same continuum as programming assembly -> IDE autocomplete -> LLM autocomplete?
You’re still writing code, but generally adding abstractions has been net good (unsure of this opinion tbf, but that’s my hunch)
When I studied in Ulaan Bataar I met a professor of linguistics from eastern Europe. Before he came to Mongolia he studied a grammar book of mongolian and tried to teach himself. He was rather proud of how far he had come.
At the first lesson he realised that the characters he thought he knew how to pronounce didn't sound much like he was used to. Mongolian is generally written with cyrillic plus a few more characters, so he expected it to be like russian or bulgarian with a few more sounds.
This is not the case. Mongolian is much closer related to korean and tibetan, and commonly sounds something like drunk cats haggling over something deceased.
I find it to be roughly the same with introductory or otherwise shallow learning material about programming. You can read as many tutorials as you want, you'll still suck at it.
When the LLM:s invent books like SICP, The Art of Computer Programming, Purely Functional Data Structures, Gang of Four, then they might become tutors in this area. To me it seems they struggle hard with anything longer than a screenful.
> What does this do to beginners that are just learning to program? Is this helping them by forcing them to become critical reviewers or harming them by being a bad role model?
Harming them.
I told a new grad employee to write some unit tests for his code, explained the high level concepts and what I was looking for, and pointed him at some resources. He spun his wheels for weeks, and it turned out he was trying to get ChatGPT to teach him how to do it, but it would always give him wrong answers.
I eventually had to tell him, point blank, to stop using ChatGPT, read the articles, and ask me (or a teammate) if he needed help.
Beginners tend to write awful code without GPT's help, so I don't think it makes things worse.
Answers don't exist in a vacuum. The chat interface allows feedback and corrections. Users can paste an error they're getting, or even say "it doesn't work", and GPT may correct itself or suggest an alternative.
> Beginners tend to write awful code without GPT's help, so I don't think it makes things worse.
> Answers don't exist in a vacuum. The chat interface allows feedback and corrections. Users can paste an error they're getting, or even say "it doesn't work", and GPT may correct itself or suggest an alternative.
I think you're making the mistake of viewing the job as a black box that produces output.
But what you're proposing is a terrible way to develop someone's skills and judgement. They won't develop if they're getting their hand held all the time (by an LLM or a person), and they'll stagnate. The problem with an LLM, unlike a person, it that it will hold your hand forever without complaint, while giving unreliable advice.
That's speculation about a hypothetical person, one that falls into learned helplessness, but there are people with different mindsets.
Getting some results with the help of infinitely-patient GPT may motivate people to learn more, as opposed to losing motivation from getting stuck, having trouble finding right answers without knowing the right terminology, and/or being told off by StackOverflow people that's a homework question.
People who want to grow, can also use GPT to ask for more explanations, and use it as a tutor. It's much better at recalling general advice.
And not everyone may want to grow into a professional developer. GPT is useful to lots of people who are not programmers, and just need to solve programming-adjacent problems, e.g. write a macro to automate a repetitive task, or customize a website.
> Getting some results with the help of infinitely-patient GPT may motivate people to learn more, as opposed to losing motivation from getting stuck, having trouble finding right answers without knowing the right terminology,
> ...People who want to grow, can also use GPT to ask for more explanations, and use it as a tutor. It's much better at recalling general advice.
The psychology there doesn't make sense, since the technology simultaneously takes away a big motivation to actually learn how to get the result on your own. It's like giving a kid a calculator and expecting him to use it to learn mental arithmetic. Instead, you actually just removed the motivation for most kids to do so.
I think there's a common, unstated assumption in tech circles that removing "friction" and making things "easier" is always good. It's false.
Also, a lot of what you said feels like a post-hoc rationalization for applying this particular technology as a solution to a particular problem, which is a big problem with discourse around "AI" (just like it was with blockchain). That stuff is just in the air.
> ...and/or being told off by StackOverflow people that's a homework question.
IMHO, that's the one legitimately demotivating thing on your list.
Same could be said of wrong stack overflow answers or random google results. Clearly they’ll become critical of the results if the code simply doesn’t compile, same as our generation sharpened our skills by filtering bad from good from google results
If this increases iteration speed for beginner devs and they learn about code quality post it goes into the real world, it’s not a bad bargain to strike imo.
I think we all partly learnt about code quality by having our code break things in the real world.
I've been saying from the start that this is not a tool for beginners and learners. My students use it constantly and I keep telling them when they go to chat GPT for answers, it's like they are going to a senior for help -- they know a lot but they are often wrong in subtle and important ways.
That's why classes are taught by professors and not undergrads. Professors are at least supposed to know what they don't know.
When students think of ChatGPT as their drunk frat bro they see doing keg stands at the Friday basement party rather than as an expert they use it differently.
> What's especially troubling is that many human programmers seem to prefer the ChatGPT answers. The Purdue researchers polled 12 programmers — admittedly a small sample size — and found they preferred ChatGPT at a rate of 35 percent and didn't catch AI-generated mistakes at 39 percent.
Absolutely, it's especially useful when it suggests which libraries to use if you're not familiar with the ecosystem. Or writing boilerplate for popular frameworks, step by step. It can, to a degree, repair errors if you paste it the output.
Every time I’ve tried it it’s sent me to completely the wrong ball park and after a while whacking it’s solution I end up completely dumping it and doing it myself.
Exactly! Just because part of the answer isn't right, doesn't mean the entire answer is useless. It's much faster than only doing a Google search when working out the solution to a problem.
Sometimes you are going to loose a lot of time trying to make a ChatGPT solution work when Google would have provided right away the right answer... Just yesterday I asked ChatGPT for an AWS IAM policy. ChatGPT-4o provided an answer that looked ok but was just wrong, tried to make it work without success. Just Googled it and the first result provided me the right answer.
I prefer Phind for this type of question since you can see search results that it's likely drawing answers from.
But ChatGPT is often a huge time saver if you know exactly what you want to do and just let it fill in the how. I have these 3 jsonl files and I want to use jq to do blah blah and then convert them to csv
This. For inexperienced developers, I advise thus; don't consume answers you don't understand. If you can't read it, interrogate it, and find a question at your own level. When you accept its emission, you're taking responsibility for it, and beyond a certain low level, it can't do your thinking for you.
I agree this is the correct way to use it, and it is incredibly useful in that case, but I think a study like this is valuable in the face of all the hype/fud about how AI Agents can program entire complex applications with just a few prompts and/or will replace software engineers shortly.
Yeah I really don't understand why research is still being published that uses GPT3.5 rather than GPT4 or both models. ~500 programming questions is maybe a few bucks on the API?
> For each of the 517 SO [Stack Overflow] questions, the first two authors manually used the SO question’s title, body, and tags to form one question prompt1 and fed that to the free version of ChatGPT, which is based on GPT-3.5. We chose the free version of ChatGPT because it captures the majority of the target population of this work. Since the target population of this research is not only industry developers but also programmers of all levels, including students and freelancers around the world, the free version of ChatGPT has significantly more users than the paid version, which costs a monthly rate of 20 US dollars.
Note that GPT-4o is now also freely available, although with usage caps. Allegedly the limit is one fifth the turns of paid Plus users, who are said to be limited to 80 turns every three hours. Which would mean 16 free GPT-4o turns per 3 hours. Though there is some indication the limits are currently somewhat lower in practice and overall in flux.
In any case, GPT-4o answers should be far more competent than those by GPT-3.5, so the study is already somewhat outdated.
I use ChatGPT for coding constantly and the 52% error rate seems about right to me. I manually approve every single line of code that ChatGPT generates for me. If I copy-paste 120 lines of code that ChatGPT has generated for me directly into my app, that is because I have gone over all 120 lines with a fine-toothed comb, and probably iterated 3-4 times already. I constantly ask ChatGPT to think about the same question, but this time with an additional caveat.
I find ChatGPT more useful from a software architecture point of view and from a trivial code point of view, and least useful at the mid-range stuff.
It can write you a great regex (make sure you double-check it) and it can explain a lot of high-level concepts in insightful ways, but it has no theory of mind -- so it never responds with "It doesn't make sense to ask me that question -- what are you really trying to achieve here?", which is the kind of thing an actually intelligent software engineer might say from time to time.
I scanned the paper and it doesn't mention what model they were using within chatgpt. If it was 3.5 turbo, then these results are already meaningless. GPT-4 and 4o are much more accurate.
I just used GPT-4o to refactor 50 files from react classes to react function components and it did so almost perfectly everytime. Some of these classes were as long as 500 loc.
I'd guess that React code is a lot easier for a LLM, since it's a frequent occurrence in its training dataset and frontend code tends to be repetitive and full of boilerplate.
I believe that AI will be a perfect programmer in the future for all niche areas. My point is that frontend will probably be the first niche to be mastered.
> AI will be a perfect programmer in the future for all NON-niche areas
There's going to be a positive/negative feedback loop that makes it hard for new languages and frameworks to gain popularity. And the lack of popularity means lack of training material being generated for the AI to learn.
When choosing a tech stack of the future, the ability for AI to pair will be a key consideration.
Ya if it's GPT-3.5, I'm actually surprised the accuracies were so high!
I've been pairing with GPT since 3.5-turbo. I run 20-100 queries a day (have an IDE integration). The improvements for GPT-4 over 3.5 are significant.
So far GPT-4o seems like a step-up for most (not all) queries I've run through it. Based on the pricing and speed, my guess is it's a smaller, more optimized model and there are some tradeoffs in that. I'm guessing we'll see a more expensive flagship model from OpenAI this year.
But honestly, these details don't really matter... Regardless of the performance and accuracy of the models today, the trend is obvious. AI will be the primary interface for writing all but the most cutting edge code.
Two years ago, I thought an AI writing code was 50 years away. Yesterday, I took a picture of an invoice on my phone, and asked GPT to recreate it in HTML and it did so perfectly.
Not meaningless when 99% of the people use the free version which apparently has license to lie to them far more than the paid version. What a fucking sick joke, pay up or we lie to you even more.
This is way better than I thought. A follow-up question would be for the times that it is wrong, how wrong is it. In other words, is the wrong answer complete rubbish or it can be a starting point towards the actual correct answer?
ChatGPT was released one and a half year ago. It basically duct tape code together from a probability model, the fact that 52% of it's coding answers a correct is amazing.
I'm still on the fence about LLMs for coding, but from talking to friends, they primarily use it to define a skeleton of code or generate code that they can then study and restructure. I don't see many developers accepting the generate code without review.
This workflow is very close to being possible. I gave it a try last year by adding exceptions and test output to clipboard automatically (requires custom code for your stack). The context has increased considerably since my last attempt and agents are now a thing (ReAct loop, etc).
- Integration to your runtime: functions called by the LLM can run your tests, linters, compiler, etc
- Agents: the LLM can define what to do, execute a few tasks, and keep going with more tasks generated by itself
- Codebase/filesystem access: could be RAG or just ability to read files in your project
- Graceful integration of the human in the agent loop: this is just an iteration of the agent but it seems useful for it to ask inputs from the programmer. Maybe even something more sophisticated where the agent waits for the programmer to change stuff in the codebase
No way. I send ChatGPT my Haskell code and the unreadable compiler error message, and it tells me what the error means in plain human terms or at least points me in the right direction.
Google and Stack Overflow are useless here, people have different situations than I do.
I find it's worse at providing working code (much less good code), but pretty good at telling me why my code doesn't compile, which is 80% of the work anyway!
ChatGPT isn’t the best coding LLM. Claude Opus is.
Also as you can always tell if a coding response works empirically mistakes are much more easily spotted than in other forms of LLM output.
Debugging with AI is more important than prompting. It requires an understanding of the intent which allows the human to prompt the model in a way that allows it to recognize its oversights.
Most code errors from LLMs can be fixed by them. The problem is an incomplete understanding of the objective which makes them commit to incorrect paths.
Being able to run code is a huge milestone. I hope the GPT5 generation can do this and thus only deliver working code. That will be a quantum leap.
> Q&A platforms have been crucial for the online help-seeking behav-
ior of programmers. However, the recent popularity of ChatGPT is
altering this trend. Despite this popularity, no comprehensive study
has been conducted to evaluate the characteristics of ChatGPT’s an-
swers to programming questions. To bridge the gap, we conducted
the first in-depth analysis of ChatGPT answers to 517 programming
questions on Stack Overflow and examined the correctness, consis-
tency, comprehensiveness, and conciseness of ChatGPT answers.
Furthermore, we conducted a large-scale linguistic analysis, as well
as a user study, to understand the characteristics of ChatGPT an-
swers from linguistic and human aspects. Our analysis shows that
52% of ChatGPT answers contain incorrect information and 77%
are verbose. Nonetheless, our user study participants still preferred
ChatGPT answers 35% of the time due to their comprehensiveness
and well-articulated language style. However, they also overlooked
the misinformation in the ChatGPT answers 39% of the time. This
implies the need to counter misinformation in ChatGPT answers to
programming questions and raise awareness of the risks associated
with seemingly correct answers.
I guess I know how to ask the right programming questions, because my feeling about it is it’s about 80-90% correct, and the rest just gets me to correct solutions much faster than a search engine.
If someone showed me this solution I'd have quite a few questions. Like, why is there a 'newline in the example section, and why isn't that part in a comment? Why introducing the "helper" to enforce that the execution always begins at 1? Could there be some other way to design the program so that the four conditions don't all end with the same twenty or so characters?
No, the context here is interns. If someone came to me and asked for an internship and showed that to me I'd signal that they need to show me something else that is more impressive, or somehow find a way to explain and motivate that code that convinces me that it is a decent solution.
I'm not sure how you're discerning "style" from "wrong". Would using some esolang be a matter of "style" as long as the asserts on the output pass?
You seem determined to shift the goalposts, so I think we're done here. It's not difficult:
1) Many (some say most) job applicants cannot write a working implementation of FizzBuzz.
2) ChatGPT can write a working implementation of FizzBuzz.
∴ ChatGPT is a better programmer than many (most) job applicants, at least on this specific problem.
QED.
What you are doing part of a long tradition of AI denial. Take (e.g.) chess. First it was "a computer cannot play chess". Then it was "a computer cannot play chess well enough to beat a human being." Then it was "a computer cannot play chess well enough to beat a grandmaster." (you are somewhere between this stage and the previous one) Then it was "a computer cannot play chess well enough to beat the world champion." Then it was "playing chess is not a measure of intelligence." Notice how the goalposts gradually move so that eventually the criterion is "it doesn't count unless the computer is better than the best person in the world." followed by "Ehh... those grapes were probably sour anyway."
For some reason, we don't get so defensive about machinery in other areas of expertise. No one tries to deny that a D11 Caterpillar can shift dirt faster than a human being with a shovel. No one tries to deny that the Webb telescope can see distant galaxies better than any human being's naked eye.
But people freak out when it comes to "intellectual" accomplishments.
Why did you write all these words and still not answer my question?
The context is interns, and I responded from the perspective that someone came to me with that code and asked to be an intern. You seem very convinced that most people that apply for software development jobs can't "write a working implementation of FizzBuzz", but I fail to see the relevance. Why count all the hairdressers and kids that have spent a few weeks on HTML and whatnot that might apply for one software related job and be refused?
It's a rhetorical question, don't waste time on it.
I think it's more interesting to look at the output from the machine as if a person had produced it and offered it in an internship process. That's a good way to at least partially neuter the influence of advertising and so on when we evaluate what you got out of it.
As for the chess part, I'm not so sure computers can play chess. Can they carry the board and pieces to the place where a match will be played? Can they unpack it, push the clock button, move the pieces? When the match takes place on e.g. Lichess, can they put a finger on a screen and move a piece, or do they need some special interface that is incompatible with humans to participate? Do they need a human to help them get to the game and initiate participation, or can they do this on their own, because they previously said to someone that they will or because they feel like it?
You're treating simulacra as ultimately real, and see me as stupid because I still have some contact with the material and don't confuse it with the virtual.
iirc, I saw some other study (or an experiment some random guy had ran) where original GPT4 had vastly outperformed its later incarnations for code generation.
current openai products either use much lower parameter models under the hood than they did originally, or maybe it's a side-effect of context stretching.
You can always email hn@ycombinator.com if you think a headline is misleading, since the site guidelines call for changing those ("Please use the original title, unless it is misleading or linkbait; don't editorialize." - https://news.ycombinator.com/newsguidelines.html)
Can someone email the author and explain what a LLM is?
People asking for 'right' answers, don't really get it. I'm sorry if that sounds abrasive, but these people give LLMs a bad name due to their own ignorance/malice.
I remember having some Amazon programmer trash LLMs for 'not being 100% accurate'. It was really an iD10t error. LLMs arent used for 100% accuracy. If you are doing that, you don't understand the technology.
There is a learning curve with LLMs, and it seems a few people still don't get it.
The real problem is that it's not marketed that way. WE may understand that but most people, heck even in my experience a large percentage of tech people, don't. They think there is some kind of true intelligence (it's literally in the name) behind it. Just like I also understand that the top results on Google are not always the best.. but my parents don't.
Sometimes that means I have a follow on question & iterate from there. That's fine too.