I spent more than 15 minutes trying to think of the first thing to bring up, but there were too many things to go through. I went on a tangent about belief bubbles then thought replying was too taxing, then I realised that not replying may be contributing to the problem of those bubbles existing in the first place. So I'll try to keep my reply brief and succinct enough to not lose my mind by just responding to only some points you brought up.
> Last year, many, probably most experienced programmers were skeptical about using LLMs to help write code.
Firstly most experienced programmers were sceptical about using LLMs to help write code, and still are.
I am qualifying a programmer as "experienced" not as those who have held a job title for 15 years, and not as those who have bounced around between projects until they release (
tangent: a project that comes out with 10 years of support is not successful just because it launches. It has to have what it takes to actually be able to be supported for the whole 10 years and not require a rewrite within 2 years), but those who have worked on enough varied projects while providing support for 10 years to understand which trade-offs are worse in the long run, and why.
Most "experienced" programmers did try LLMs, saw the hallucinations, code churn, and logical errors and decided to stay away from LLMs.
But individual anecdotes aren't that useful which is why we need to look at qualitative results over quantitive.
Large groups started doing their own studies which have just begun to start outputting qualitative results in the last year.
The results show that using LLMs in projects leads to more bugs, increased code churn, and less understanding of a codebase's internals, each of which have been shown to reduce the health of a project (for both codebase and developers).
And because of those results, experienced programmers who were on a wait-and-see path further away from using LLMs to help write code.
> This year, there are vanishingly few who don't use an LLM every day.
It's worth pointing out that the following two statements are both true:
1. the number of experienced programmers has risen
2. experienced programmers are vanishing few
1. Is true because there are more programmers than there were 10 years ago, and the number of programmers could be qualified as experienced has also risen
2. Is true because the experience and quality of programmers looks less like a normal distribution and more like a log-normal distribution. (The quantity goes up but the percentage remains low, so be weary of quantitative claims).
The "vanishingly" few are the experienced programmers mentioned in previous point.
> Words are easiest for LLMs so programming changed first.
Programming isn't just using words, it's using logic. Cobol showed us that spoken languages make poor programming languages.
What LLMs were good at was reciting common boilerplate, since it's always the same words in the same order. It didn't need to understand the logic behind those words.
But as soon as it steps out of that wheelhouse, it falls down. (more on that in the later tangent)
> Images, audio and 3D are next.
Much like with text generation: images, audio and 3d fail at the same points.
Image generation can recite textures, but has no understanding of composition.
Audio can recite phonemes, but has no understanding of delivery.
3d can recite a shapes or details, but has no understanding of a modeller might use want to use a certain techniques in specific places so the shapes end up being problematic to work with.
> It really isn't a question whether the change is coming, it is just a matter of time. And, change is coming faster and faster. It isn't perfect, but it is getting better at an alarming rate.
Actually no, we're spending more resources trying to give each model more parameters to use, more time to train and a larger corpus to ingest and we're getting diminishing returns. Neural scaling laws are not exponential, and they are not linear, they are logarithmic.
-------------------------
Tangent:
Investors were hoping for exponential returns, but when it became obvious that wasn't going to happen, the large AI players suggested that linear increases in compute and training would deliver linear results. (the implication being put all your money in our pot because it will be the largest and so return the best results)
Every time they announce massive productivity gains in programming and yet every time questions come in once the reports have been scrutinised and they push out a new report to attempt to drown out the questions, all to keep their own valuations high.
Recent examples with OpenAI:
- OpenAI released a bunch of results stating that their models could solve about 49% of test cases in an external software engineering benchmark (SWE-Bench based on issues in 12 open source python codebases). But after an external team manually verified each test it turned out that it could only solve about 4% of test cases and would have created more work for programmers had the other suggested changes been applied. See: https://www.youtube.com/watch?v=QnOc_kKKuac
- Then, OpenAI's researchers built a new benchmark (SWE-Lancer based on SWE tasks on UpWork), tested a bunch of models and checked the results. Turns out if you read the paper that only half the questions were asking the LLM to make a change to the code. The other half were management questions picking which solution to go with. (of course that's going to be easy to game since one dev can say: "Hey this isn't very good but it'll work for now" and another dev can say: "This will solve all of our problems" and the LLM will pick the second solution every time). It's not checking the logic in the code, or picking the best solution for what the project needs moving forwards, it's just picking the most confident sounding message put forwards by others. See: https://arxiv.org/pdf/2502.12115
and so on...
The other big players are just as bad. (I'm not even going to get into how Anthropic's CEO is inflating the bubble by claiming that AI will be writing 90% of code in 3-6 months to gullible investors.)
-------------------------
> It is just a new interface, not a replacement. The new interface allows you to iterate in different ways to produce a result. It isn't about providing the right magic incantation and having a perfect result. Like with current tools, you iterate, taking bigger steps each time and as you learn how to better use the system.
Yes it won't give you the perfect result, but that doesn't matter if you can iterate towards it.
The problem is that it's actually worse for iterating towards a quality result.
An LLM will suggest something for fixing one issue, then when it gets into another issue that touches the same file, it leads to having to remove a whole chunk and rewrite something that could have just been written the first way.
The results from those qualitative studies are showing that diffs are coming in more frequently and they're coming in larger. This is increased code churn (which as explained before is bad for the project's health).
Though we seem to disagree, perhaps less than you might think, I'm very happy that my comments made you think for so long and write so much in response. It will be interesting to see how this all plays out over time.
I spent more than 15 minutes trying to think of the first thing to bring up, but there were too many things to go through. I went on a tangent about belief bubbles then thought replying was too taxing, then I realised that not replying may be contributing to the problem of those bubbles existing in the first place. So I'll try to keep my reply brief and succinct enough to not lose my mind by just responding to only some points you brought up.
> Last year, many, probably most experienced programmers were skeptical about using LLMs to help write code.
Firstly most experienced programmers were sceptical about using LLMs to help write code, and still are.
I am qualifying a programmer as "experienced" not as those who have held a job title for 15 years, and not as those who have bounced around between projects until they release ( tangent: a project that comes out with 10 years of support is not successful just because it launches. It has to have what it takes to actually be able to be supported for the whole 10 years and not require a rewrite within 2 years), but those who have worked on enough varied projects while providing support for 10 years to understand which trade-offs are worse in the long run, and why.
Most "experienced" programmers did try LLMs, saw the hallucinations, code churn, and logical errors and decided to stay away from LLMs.
But individual anecdotes aren't that useful which is why we need to look at qualitative results over quantitive.
Large groups started doing their own studies which have just begun to start outputting qualitative results in the last year.
The results show that using LLMs in projects leads to more bugs, increased code churn, and less understanding of a codebase's internals, each of which have been shown to reduce the health of a project (for both codebase and developers).
And because of those results, experienced programmers who were on a wait-and-see path further away from using LLMs to help write code.
> This year, there are vanishingly few who don't use an LLM every day.
It's worth pointing out that the following two statements are both true:
1. the number of experienced programmers has risen 2. experienced programmers are vanishing few
1. Is true because there are more programmers than there were 10 years ago, and the number of programmers could be qualified as experienced has also risen
2. Is true because the experience and quality of programmers looks less like a normal distribution and more like a log-normal distribution. (The quantity goes up but the percentage remains low, so be weary of quantitative claims).
The "vanishingly" few are the experienced programmers mentioned in previous point.
> Words are easiest for LLMs so programming changed first.
Programming isn't just using words, it's using logic. Cobol showed us that spoken languages make poor programming languages.
What LLMs were good at was reciting common boilerplate, since it's always the same words in the same order. It didn't need to understand the logic behind those words.
But as soon as it steps out of that wheelhouse, it falls down. (more on that in the later tangent)
> Images, audio and 3D are next.
Much like with text generation: images, audio and 3d fail at the same points.
Image generation can recite textures, but has no understanding of composition.
Audio can recite phonemes, but has no understanding of delivery.
3d can recite a shapes or details, but has no understanding of a modeller might use want to use a certain techniques in specific places so the shapes end up being problematic to work with.
> It really isn't a question whether the change is coming, it is just a matter of time. And, change is coming faster and faster. It isn't perfect, but it is getting better at an alarming rate.
Actually no, we're spending more resources trying to give each model more parameters to use, more time to train and a larger corpus to ingest and we're getting diminishing returns. Neural scaling laws are not exponential, and they are not linear, they are logarithmic.
-------------------------
Tangent:
Investors were hoping for exponential returns, but when it became obvious that wasn't going to happen, the large AI players suggested that linear increases in compute and training would deliver linear results. (the implication being put all your money in our pot because it will be the largest and so return the best results)
Every time they announce massive productivity gains in programming and yet every time questions come in once the reports have been scrutinised and they push out a new report to attempt to drown out the questions, all to keep their own valuations high.
Recent examples with OpenAI:
- OpenAI released a bunch of results stating that their models could solve about 49% of test cases in an external software engineering benchmark (SWE-Bench based on issues in 12 open source python codebases). But after an external team manually verified each test it turned out that it could only solve about 4% of test cases and would have created more work for programmers had the other suggested changes been applied. See: https://www.youtube.com/watch?v=QnOc_kKKuac
- Then, OpenAI's researchers built a new benchmark (SWE-Lancer based on SWE tasks on UpWork), tested a bunch of models and checked the results. Turns out if you read the paper that only half the questions were asking the LLM to make a change to the code. The other half were management questions picking which solution to go with. (of course that's going to be easy to game since one dev can say: "Hey this isn't very good but it'll work for now" and another dev can say: "This will solve all of our problems" and the LLM will pick the second solution every time). It's not checking the logic in the code, or picking the best solution for what the project needs moving forwards, it's just picking the most confident sounding message put forwards by others. See: https://arxiv.org/pdf/2502.12115
and so on...
The other big players are just as bad. (I'm not even going to get into how Anthropic's CEO is inflating the bubble by claiming that AI will be writing 90% of code in 3-6 months to gullible investors.)
-------------------------
> It is just a new interface, not a replacement. The new interface allows you to iterate in different ways to produce a result. It isn't about providing the right magic incantation and having a perfect result. Like with current tools, you iterate, taking bigger steps each time and as you learn how to better use the system.
Yes it won't give you the perfect result, but that doesn't matter if you can iterate towards it.
The problem is that it's actually worse for iterating towards a quality result.
An LLM will suggest something for fixing one issue, then when it gets into another issue that touches the same file, it leads to having to remove a whole chunk and rewrite something that could have just been written the first way.
The results from those qualitative studies are showing that diffs are coming in more frequently and they're coming in larger. This is increased code churn (which as explained before is bad for the project's health).