1. How do you check the output of the voice to code step? If you need as much expertise as you do now to actually review the code, then the voice to code step is just a layer that adds confusion
2. How would debugging work? Again, would you still need to be able to understand the code? Same issue.
3. What if you have to pause and think? This will affect how the voice to code interface interprets your speech.
4. How would you make a precise edit to your source audio using a voice interface?
5. How would you make changes which touch multiple components across the project? How would you coordinate this?
6. Precisely defining interfaces between components and using correct references to specific symbols is very difficult to do in natural speech, which typically uses context to resolve ambiguous references. The language you would be using would still have to resemble the strictness of a programming language even when spoken, but you have replaced a reliable checkable channel (input through keyboard, transfer as-as to text buffer, feedback from visual view of source) with an unreliable channel (input through microphone, transfer through complex signal processing and multiple neural network language models, through multiple representations, where you have to be able to check multiple representations for feedback about the structure of your program (initial speech-to-text step, text to source))
It may work. Since it may be just a new "programming language" (somewhat literally), i.e. a new level of high level abstraction. We already know examples of such transition to higher abstraction levels: binary code -> assembly languages -> c/lisp/fortran/etc -> c++/javascript/go/python/r -> np/torch/react/whatever frameworks/libraries. For an average programmer nowadays knowledge of frameworks/libraries is as important (if not more important even) as actual knowledge of the programming language they use. The only disadvantage of this is that people will need to adapt to something generated and updated via a machine learning. So far there are not much examples of that, except maybe people adapting to Tesla Autopilot with every new release. Before we were adapting to a new c++/python/framework version, in future there will be GitHubNext v1, v2 and v3 with known features and bugs.
The only problem with this being a next abstraction level, is that it actually leads to more "coding", because of general spoken language being less informationly dense as any programming language.
Before by switching from binary to assembly to higher level languages to frameworks/libraries, you generally reduce amount of "code" being written after each step, with voice programming this seems to be the opposite.
You free part of the industry from needing professionals for some simple tasks, which is enough to empower users with a lot of possibilities, and focus pros on where they are really needed.
"Hey phone, next time mum send me a text about voting, you can send back a 'ok boomer'?" or "hey phone, can you setup a webpage that list my tiktok videos up to last year?"
Most people are not going to hire a pro for that, but we could end up with a AI general enough to be able to do that for users. It's ok if the result is not extensible, maintainable or modular.
The crazy thing is... this probably will work.
In 20 years. But, it probably will work.
There is absolutely no reason you cannot use a neural network to transcribe appropriately phrased requirements into an AST.