Saying this as a user of these tools (openai, Google voice chat etc). These are ...

cyjackx · 2025-05-06T02:17:58 1746497878

I think the problem is that it's also an interesting problem for humans. It's very subjective. Imagine a therapy session, filled with a long pensive pauses. Therapy is one of those things that encourage not interrupting and just letting you talk more, but there's so much subtext and nuance to that. Then make it compared to excited chatter one might have with friends. There's also so much body language that an AI obviously cannot see. At least for now.

joshstrange · 2025-05-05T21:19:28 1746479968

This 100%, yes!

I've found myself putting in filler words or holding a noise "Uhhhhhhhhh" while I'm trying to form a thought but I don't want the LLM to start replying. It's a really hard problem for sure. Similar to the problem of allowing for interruptions but not stopping if the user just says "Right!", "Yes", aka active listening.

One thing I love about MacWhisper (not special to just this STT tool) is it's hold to talk so I can stop talking for as long as I want then start again without it deciding I'm done.

Quizzical4230 · 2025-05-06T05:27:47 1746509267

I recently got to know about this[^1] paper that differentiates between 'uh' and 'um'.

> The proposal examined here is that speakers use uh and um to announce that they are initiating what they expect to be a minor (uh), or major (um), delay in speaking. Speakers can use these announcements in turn to implicate, for example, that they are searching for a word, are deciding what to say next, want to keep the floor, or want to cede the floor. Evidence for the proposal comes from several large corpora of spontaneous speech. The evidence shows that speakers monitor their speech plans for upcoming delays worthy of comment. When they discover such a delay, they formulate where and how to suspend speaking, which item to produce (uh or um), whether to attach it as a clitic onto the previous word (as in “and-uh”), and whether to prolong it. The argument is that uh and um are conventional English words, and speakers plan for, formulate, and produce them just as they would any word.

[1]: https://www.sciencedirect.com/science/article/abs/pii/S00100...

VagabundoP · 2025-05-06T09:52:58 1746525178

I hate when you get "out of sync" with someone for a whole conversation. I imagine sine ways on an occilloscope and there they just slightly out of phase.

You nearly have to do a hard reset to get things comforatble - walk out of the room, ring the back.

But some people are just out of sync with the world.

jiehong · 2025-05-06T01:33:31 1746495211

So they basically train us to worsen our speech to avoid being interrupted.

I remember my literature teacher telling us in class how we should avoid those filler words, and instead allow for some simple silences while thinking.

Although, to be fair, there are quite a few people in real life using long filler words to avoid anyone interfering them, and it’s obnoxious.

Bjartr · 2025-05-06T01:50:50 1746496250

Somehow need to overlap an LLM with vocal stream processing to identify semantically meaningful transition points to interrupt naturally instead of just looking for any pause or sentence boundary.

WhitneyLand · 2025-05-05T20:48:30 1746478110

>>they don't allow talking naturally

Neither do phone calls. Round trip latency can easily be 300ms, which we’ve all learned to adapt our speech to.

If you want to feel true luxury find an old analog PTSN line. No compression artifacts or delays. Beautiful and seamless 50ms latency.

Digital was a terrible event for call quality.

mvdtnz · 2025-05-05T21:32:46 1746480766

I don't know how your post is relevant to the discussion of AI models interrupting if I pause for half a second?

jyoung8607 · 2025-05-06T03:12:46 1746501166

It's genuinely a very similar problem. The max round trip latency before polite humans start having trouble talking over each other has been well studied since the origins of the Bell Telephone system. IIRC we really like it to be under about 300ms.

AI has processing delay even if run locally. In telephony the delays are more speed-of-light dictated. But the impact on human interactive conversation is the same.

genewitch · 2025-05-06T16:12:09 1746547929

Is it because you've never used copper pair telephone networks and only have used digital or cellular networks?

POTS is magical if you get end to end. Which I don't think is really a thing anymore. The last time I made a copper to copper call on POTS was in 2015! At&t was charging nearly $40 for that analog line per month so I shut it off. My VoIP line with long distance and international calling (the pots didn't) is $20/month with two phone numbers. And its routed through a PBX I control.

cwackerfuss · 2025-05-06T04:35:17 1746506117

This is called turn detection, and there are some great tools coming out to solve this recently. (One user mentioned Livekit’s turn detection model). I think in a years time we will see dramatic improvement.

energy123 · 2025-05-06T06:01:18 1746511278

If the turn detection model is small, could you run it on the edge and have like 10-50ms "shut the hell up" latency? That'd be nice.

tjbiddle · 2025-05-06T14:03:15 1746540195

Ha - I have this issue even with non-AI voice assistants like Alexa.

"Hey Alexa, turn the lights to..." thinks for a second while I decide on my mood

"I don't know how to set lights to that setting"

"...blue... damnit."

randomcatuser · 2025-05-06T03:25:23 1746501923

yeah the demo I saw was: https://x.com/livekit/status/1870194686532694417

But searching for "voice detection with pauses", it seems there's a lot of new contenders!

https://x.com/kwindla/status/1897711929617154148

this one is a fun approach too https://x.com/zan2434/status/1753660774541849020

smusamashah · 2025-05-06T09:26:34 1746523594

This is the one I saw https://x.com/kwindla/status/1870974144831275410

qwertox · 2025-05-05T20:36:35 1746477395

Maybe we should settle on some special sound or word which officially signals that we're making a pause for whatever reason, but that we intend to continue with dictating in a couple of seconds. Like "Hmm, wait".

ivape · 2025-05-05T20:41:45 1746477705

Two input streams sounds like a good hacky solution. One input stream captures everything, the second is on the look out for your filler words like "um, aahh, waaiit, no nevermind, scratch that". The second stream can act as the veto-command and cut off the LLM. A third input stream can simply be on the lookout for long pauses. All this gets very resource intensive quickly. I been meaning to make this but since I haven't, I'm going to punish myself and just give the idea away. Hopefully I'll learn my lesson.

twodave · 2025-05-05T20:39:52 1746477592

Alternatively we could pretend it’s a radio and follow those conventions.

flippy_flops · 2025-05-06T04:17:19 1746505039

Need some vocal version of “heredoc”

accrual · 2025-05-06T04:43:54 1746506634

"Hello AI, over", "Hello human, over". :)

Oh, wait: "How do I iterate over a list-", "Iteration is a process where..." :p

genewitch · 2025-05-06T16:15:58 1746548158

We can recreate Shakma while we're at it with all the times we say "... Over."

emtrixx · 2025-05-05T20:40:46 1746477646

Could that not work with simple instructions? Let the AI decide to respond only with a special wait token until it thinks you are ready. Might not work perfectly but would be a start.

scotty79 · 2025-05-06T12:25:59 1746534359

Pauses are good as a first indicator, but when a pause occurs then what's been said so far should be fed to the model to decide if it's time to chip in or wait a bit for more.

LZ_Khan · 2025-05-05T21:00:45 1746478845

Honestly I think this is a problem of over-engineering and simply allowing the user to press a button when he wants to start talking and press it when he's done is good enough. Or even a codeword for start and finish.

We don't need to feel like we're talking to a real person yet.

joshspankit · 2025-05-07T04:14:18 1746591258

My ideal would be a small "stick remote" with a mic button.

The AI listens as long as you hold the button, and the device is efficient enough to carry with you 24/7.

amelius · 2025-05-05T22:47:03 1746485223

Or give the AI an Asian accent. If you're talking on the phone to someone on a different continent you accept the delay, so why not here.

SubiculumCode · 2025-05-05T20:41:18 1746477678

Yeah, when I am trying to learn about a topic, I need to think about my question, you know, pausing mid-sentence. All the products jump in and interrupt, no matter if I tell them not to do so. Non-annoying humans don't jump in to fill the gap, they read my face, they take cues, then wait for me to finish. Its one thing to ask an AI to give me directions to the nearest taco stand, its another to have a dialogue about complex topics.

mmoustafa · 2025-05-06T01:28:34 1746494914

Huge problem space. Usually referred to as “turn taking”