FWIW in my recent experience I've found LLMs are very good at reading through th...

vunderba · 2025-06-11T22:37:52 1749681472

Pairing speech recognition with a LLM acting as a post-processor is a pretty good approach.

I put together a script a while back which converts any passed audio file (wav, mp3, etc.), normalizes the audio, passes it to ggerganov whisper for transcription, and then forwards to an LLM to clean the text. I've used it with a pretty high rate of success on some of my very old and poorly recorded voice dictation recordings from over a decade ago.

Public gist in case anyone finds it useful:

https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...

sovok · 2025-06-12T02:33:30 1749695610

An LLM step also works pretty well for diarization. You get a transcript with speaker-segmentation (with whisper and pyannote for example), SPEAKER_01 says at some point „Hi I’m Bob. And here’s Alice“, SPEAKER_02 says „Hi Bob“ and now the LLM can infer that SPEAKER_01 = Bob and SPEAKER_02 = Alice.

soulofmischief · 2025-06-12T04:53:17 1749703997

Yep, my agent i built years ago worked very well with this approach, using a whisper-pyannote combo. The fun part is knowning when to end transcription in noisy environments like a coffee shop.

Tokumei-no-hito · 2025-06-11T22:44:55 1749681895

thanks for sharing. are some local models better than others? can small models work well or do you want 8B+?

vunderba · 2025-06-11T22:53:17 1749682397

So in my experience smaller models tend to produce worse results BUT I actually got really good transcription cleanup with CoT (Chain of Thought models) like Qwen even quantized down to 8b.

dragonwriter · 2025-06-12T14:24:32 1749738272

I think the 8B+ question was about parameter count (8 billion+ parameters), not quantization level (8 bits per weight).

vunderba · 2025-06-12T16:41:34 1749746494

Yeah I should have been more specific - Qwen 8b at a 5_K_M quant worked very well.

mikepurvis · 2025-06-11T22:19:22 1749680362

I was going to say, ideally you’d be able to funnel alternates to the LLM, because it would be vastly better equipped to judge what is a reasonable next word than a purely phonetic model.

ianbicking · 2025-06-11T23:27:56 1749684476

If you just give the transcript, and tell the LLM it is a voice transcript with possible errors, then it actually does a great job in most cases. I mostly have problems with mistranscriptions saying something entirely plausible but not at all what I said. Because the STT engine is trying to make a semantically valid transcription it often produces grammatically correct, semantically plausible, and incorrect transcriptions. These really foil the LLM.

Even if you can just mark the text as suspicious I think in an interactive application this would give the LLM enough information to confirm what you were saying when a really critical piece of text is low confidence. The LLM doesn't just know what are the most plausible words and phrases for the user to say, but the LLM can also evaluate if the overall gist is high or low confidence, and if the resulting action is high or low risk.

miki123211 · 2025-06-12T12:03:59 1749729839

This is actually something people used to do.

old ASR systems (even models like Wav2vec) were usually combined with a language model. It wasn't a large language model, those didn't exist at the time, it was usually something based on n-grams.

throwawaymaths · 2025-06-11T22:50:36 1749682236

do you know if any current locally hostable public transcribers are good at diarization? for some tasks having even crude diarization would improve QOL by a huge factor. i was looking at a whisper diarization python package for a bit but it was a bitch to deploy.

philipkiely · 2025-06-12T00:49:11 1749689351

WhisperX! https://github.com/basetenlabs/truss-examples/tree/main/whis...

throwawaymaths · 2025-06-12T02:14:08 1749694448

yeah as i said, i couldn't figure out how to deploy whisper-diarization.

genewitch · 2025-06-12T20:07:21 1749758841

so you need python - a full install, and git. Doesn't matter OS. python venv (virtual environment) ensures that this folder, once it works, is locked to all the versions inside it, including the python version. this works for any software that uses pip to set up, or any python stuff in general.

  git clone <whisper-diarization.git URL>
  cd whisper-diarization
  python -m venv .
  cd scripts
  # and then depending on your OS it's activate.sh, activate.ps1, activate.bat, etc. so on linux [0]

your prompt should change to say

(whisper-diarization) <your OS prompt>$

now you can type

  cd ..
  pip install -c constraints.txt -r requirements.txt
  python ./diarize.py --no-stem --suppress_numerals --whisper-model large-v3-turbo --device cuda -a <FILE>

next time you want to use it, you can just do like

  cd ~/whisper-diarization
  scripts/activate.sh (or whatever) [0]
  python ./diarize.py [...]

[0] To activate a Python virtual environment created with venv, use the command

  source venv/bin/activate

on Linux or macOS, or

  venv\Scripts\activate

on Windows. This will change your terminal prompt to indicate that the virtual environment is active.

(the [0] note was 'AI generated' by DDG, but whatever, linux puts it in ./bin/activate and windows puts it in ./Scripts/activate.ps1 (ideally))

iainmerrick · 2025-06-11T22:53:02 1749682382

Deepgram does it.

throwawaymaths · 2025-06-11T22:54:18 1749682458

sorry i meant locally hostable public. ill edit parent.