Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The feature I want is speaker differentiation - I want to feed in an audio file and get back a transcript with "Speaker 1: ..., Speaker 2: ..." indications.

That plus timestamps would be incredible.

The Google Gemini 2.0 models are showing some promise with this, I can't speak to their reliability just yet though.




I had good results with pyannote and the following model for that use case in the past https://huggingface.co/pyannote/speaker-diarization-3.1


I thought Deepgram already did speaker diarization (which is differentiation) pretty well. That and it can include timestamps plus other metadata.


WhisperX does all of this, I use it all the time to transcribe meeting notes. Both speaker differentiation and individual word timestamps.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: