I also find current voice interfaces are terrible. I only use voice commands to set timers or play music.
That said, voice is the original social interface for humans. We learn to speak much earlier than we learn to read/write.
Better voice UIs will be built to make new workflows with AI feel natural. I'm thinking along the lines of a conversational companion, like the "Jarvis" AI in the Iron Man movies.
That doesn't exist right now, but it seems inevitable that real-time, voice-directed AI agent interfaces will be perfected in coming years. Companies, like [Eleven Labs](https://elevenlabs.io/), are already working on the building blocks.
It doesn't work well at all with ChatGPT. You say something, and in the middle of a sentence, ChatGPT in Voice mode replies to you something completely unrelated
It works great with my kids sometimes. Asking a series of questions about some kid-level science topic for instance. They get to direct it to exactly what they want to know, and you can see they are more actively engaged than watching some youtube video or whatever.
I'm sure it helps that it's not getting outside of well-established facts, and is asking for facts and not novel design tasks.
I'm not sure but it also seems to adopt a more intimate tone of voice as they get deeper into a topic, very cozy. The voice itself is tuned to the conversational context. It probably infers that this is kid stuff too.
Voice is really sub-par and slow, even if you're healthy and abled. And loud and annoying in shared spaces.
I wonder if we'll have smart-lens glasses where our eyes 'type' much faster than we could possibly talk. Predicative text keyboards tracking eyeballs is something that already exists. I wonder if AI and smartglasses is a natural combo for a future formfactor. Meta seems to be leaning that way with their RayBan collaboration and rumors of adding a screen to the lenses.
I am also very skeptical about voice, not least because I've been disappointed daily by a decade of braindead idiot "assistants" like Siri, Alexa, and Google Assistant (to be clear I am criticizing only pre-LLM voice assistants).
The problem with voice input to me is mainly knowing when to start processing. When humans listen, we stream and process the words constantly and wait until either a detection that the other person expects a response (just enough of a pause, or a questioning tone), or as an exception, until we feel we have justification to interrupt (e.g. "Oh yeah, Jane already briefed me on the Johnson project")
Even talking to ChatGPT which embarrasses those old voice bots, I find that it is still very bad at guessing when I'm done when I'm speaking casually, and then once it's responded with nonsense based on a half sentence, I feel it's a polluted context and I probably need to clear it and repeat myself. I'd rather just type.
I think there's not much need to stream the spoken tokens into the model in realtime given that it can think so fast. I'd rather it just listen, have a specialized model simply try to determine when I'm done, and then clean up and abridge my utterance (for instance, when I correct myself) and THEN have the real LLM process the cleaned-up query.