Pairing speech recognition with a LLM acting as a post-processor is a pretty good approach.
I put together a script a while back which converts any passed audio file (wav, mp3, etc.), normalizes the audio, passes it to ggerganov whisper for transcription, and then forwards to an LLM to clean the text. I've used it with a pretty high rate of success on some of my very old and poorly recorded voice dictation recordings from over a decade ago.
An LLM step also works pretty well for diarization. You get a transcript with speaker-segmentation (with whisper and pyannote for example), SPEAKER_01 says at some point „Hi I’m Bob. And here’s Alice“, SPEAKER_02 says „Hi Bob“ and now the LLM can infer that SPEAKER_01 = Bob and SPEAKER_02 = Alice.
Yep, my agent i built years ago worked very well with this approach, using a whisper-pyannote combo. The fun part is knowning when to end transcription in noisy environments like a coffee shop.
So in my experience smaller models tend to produce worse results BUT I actually got really good transcription cleanup with CoT (Chain of Thought models) like Qwen even quantized down to 8b.
I was going to say, ideally you’d be able to funnel alternates to the LLM, because it would be vastly better equipped to judge what is a reasonable next word than a purely phonetic model.
If you just give the transcript, and tell the LLM it is a voice transcript with possible errors, then it actually does a great job in most cases. I mostly have problems with mistranscriptions saying something entirely plausible but not at all what I said. Because the STT engine is trying to make a semantically valid transcription it often produces grammatically correct, semantically plausible, and incorrect transcriptions. These really foil the LLM.
Even if you can just mark the text as suspicious I think in an interactive application this would give the LLM enough information to confirm what you were saying when a really critical piece of text is low confidence. The LLM doesn't just know what are the most plausible words and phrases for the user to say, but the LLM can also evaluate if the overall gist is high or low confidence, and if the resulting action is high or low risk.
old ASR systems (even models like Wav2vec) were usually combined with a language model. It wasn't a large language model, those didn't exist at the time, it was usually something based on n-grams.
do you know if any current locally hostable public transcribers are good at diarization? for some tasks having even crude diarization would improve QOL by a huge factor. i was looking at a whisper diarization python package for a bit but it was a bitch to deploy.
so you need python - a full install, and git. Doesn't matter OS. python venv (virtual environment) ensures that this folder, once it works, is locked to all the versions inside it, including the python version. this works for any software that uses pip to set up, or any python stuff in general.
git clone <whisper-diarization.git URL>
cd whisper-diarization
python -m venv .
cd scripts
# and then depending on your OS it's activate.sh, activate.ps1, activate.bat, etc. so on linux [0]
your prompt should change to say
(whisper-diarization) <your OS prompt>$
now you can type
cd ..
pip install -c constraints.txt -r requirements.txt
python ./diarize.py --no-stem --suppress_numerals --whisper-model large-v3-turbo --device cuda -a <FILE>
next time you want to use it, you can just do like
cd ~/whisper-diarization
scripts/activate.sh (or whatever) [0]
python ./diarize.py [...]
[0]
To activate a Python virtual environment created with venv, use the command
source venv/bin/activate
on Linux or macOS, or
venv\Scripts\activate
on Windows. This will change your terminal prompt to indicate that the virtual environment is active.
(the [0] note was 'AI generated' by DDG, but whatever, linux puts it in ./bin/activate and windows puts it in ./Scripts/activate.ps1 (ideally))
(I've yet to experiment with giving the LLM alternate transcriptions or confidence levels, but I bet they could make good use of that too)