Hacker News new | past | comments | ask | show | jobs | submit login
Self-hosted offline transcription and diarization service with LLM summary (github.com/transcriptionstream)
200 points by indigodaddy 8 months ago | hide | past | favorite | 37 comments



I've been using this:

https://github.com/bugbakery/transcribee

It's noticeably work-in-progress but it does the job and has a nice UI to edit transcriptions and speakers etc.

It's running on the CPU for me, would be nice to have something that can make use of a 4GB Nvidia GPU, which faster-whisper is actually able to [1]

https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file...


I couldn’t get the container to run for the life of me.


Yeah the docker setup is broken, I just worked around it locally

https://github.com/bugbakery/transcribee/issues/427#issuecom...


That's cool. I've created a website(https://papertube.site) that essentially transcribes video conversations for reading on Kindle. Right now, I'm relying on third-party APIs, but I was thinking about self-hosting to reduce costs.


It's like reverse audio-book, but how do you tackle issues related to video content, as the visual medium contains more information dimension than just sound.


Not handling video content here. The focus is on cases where no visual content is needed, such as podcasts, TED talks, and conversations.


oh that's awesome. I wanted to build that as well. tiny nit pick, left aligned is easier to read than justified, imo.


Are you interested in collaboration or JV?


Not op but what's JV?


From the context it seems to mean Joint Venture?


Built something similar for podcasts

https://www.podsnacks.org/


Anybody have recommendations for an easy way to grab "outbound" audio regardless of source? I meet with clients on a wide range of platforms and would love to be able to universally grab their audio to use here, regardless of what platform we're in. I know there's plenty of services, but would love to keep it all local.


BlackHole: macOS Audio Loopback Driver

https://news.ycombinator.com/item?id=40270219

Not sure what you mean by source/platforms, you might find what you need for your operating system on that discussion 3 weeks ago with links to options for macOS, Linux, and Windows.


Keep in mind that Blackhole has to be licensed for commercial use


Ha, forgot to specify Windows. I'll dig into the discussion, thanks!


I do what you're describing with VB-CABLE on Windows (and Blackhole-2ch on macOS) + OBS

https://vb-audio.com/Cable/index.htm

So, if you wanted to transcribe the audio from one of your physical output devices (say you had your meeting software outputting on external headphones), you could set the virtual audio device as the monitoring device on the physical external headphones. Therefore, you end up with a virtual audio input device containing the audio from your meeting software. I also do this to apply a chain of filters to my condenser mic in OBS because it picks up everything.


I was able to build something that does all this, more or less, in a couple weeks. It works really well.

I wanted to be able to transcribe and diarize in realtime though, which is much harder. Didn't manage to make that happen.


Amazing. I’ll see if I can get this working on Mac too. I have so many use cases for this.

30 years of audio that needs transcribing, summaries, and worksheets made out of them.


Whisper works ... kinda. I'm hoping there's another set of models released at some point, the error rate isn't appalling to me because i am transcribing TV shows and radio shows for personal use, so it's not mission critical.

There are a few whisper diarization "projects" but i've never been able to get it to work. Whisper does have word-level timestamps, so it should be simple to "plug in" diarization.

I don't need an LLM or whatever this project has, but i will see if it's runnable and if it's any better than what a couple podcasts i listen to use.

edit: see some people mentioning whisperx, which is one of those things that was cool until moving fast broke things:

>As of Oct 11, 2023, there is a known issue regarding slow performance with pyannote/Speaker-Diarization-3.0 in whisperX. It is due to dependency conflicts between faster-whisper and pyannote-audio 3.0.0. Please see this issue for more details and potential workarounds.

which means that what i gain is a ~3x increase in large-v2 speeds but i instantly lose those gains with diarization, unless i track down 8 month old bug workarounds.

I'll stick with the py venv whisper install i've been using for the last 16 months, tyvm


Re: Diarization, I had decent results with testing this on Colab a while ago:

https://github.com/MahmoudAshraf97/whisper-diarization

I remember having the usual python package hell when NeMo was updated somewhere, but it seems to be decently well maintained so give it a go.

*Edit, I remember reading somewhere that pyannote was a weak link in other repos, that might be why your other tests were not great.


I would love to hear more about your use case!


I was just building out one of these for myself and working through some dependency issues! Cool!!


Great hack, like it so much, thank you! Out of curiosity: transcription and dizrization are very similar processes, the latter just adds "Speaker 1/2/3" to each paragraph. Why two different workflows?


Is tehre a transcription engine (?) that works on Javascript? I'd love to make a browser extension that allows me to transcript Whatsapp audios instead of having to listen to them.


There's examples of whisper.cpp that run in the browser. Pretty laggy but they work.



I thought local LLMs were unable to summarize large documents due to limited token counts or something like that? Can someone ELI5?


You batch them. If token limit is 32k for example, you summarize them in batches of 32k tokens (inc. output) then summarize all the partial summaries.

It's what we were doing at our company until Anthropic and others released larger context window LLMs. We do the TTS locally (whisperX) and the summarization via API. Though we've tried with local LLMs, too.


Well it'll always depend on the length of the meeting to summarize. But they are using mistral which clocks at 32k context. With an average of 150 spoken words per minute, 1 token ~= word (which is rather pessimistic), that's 3h30m of meeting. So I guess that's okay?


  mistral which clocks at 32k context
I may be wrong, but my understanding was/is:

- Mistral can handle 32k context, but only using sliding window attention. So it can't really process all 32k tokens at once.

- Mixtral (note the 'x') 8x7B can handle 32k context without resorting to sliding window attention.

I wonder whether Mistral would do a better job summarizing a long (32k token) doc all at once, or using recursive summarization.


Hmm. Interesting question. We had no issues using Mixtral 8x7B for this, perhaps reinforcing your point. We use fine-tuned Mistral-7B instances but not for long context stuff.

Maybe a neat eval to try.


What is the cost compared with something like Whisper API? Assuming one would use commodity cloud GPUs for self hosting


WhisperX along with whisper-diarization, runs at something around 20x of real time on audio with a modern GPU, so for that part, you're looking at around $1 per twenty hours of content to run it on a g5.xlarge, not counting time to build up a node (or around 1/2 that for Spot prices, assuming you're much luckier than I am at getting stable spot instances these days).

You can short circuit that time to build up a node a bit with a prebaked AMI on AWS, but there's still some amount of time before a new node can start running at speed, around 10 minutes in my experience.

I haven't looked at this particular solution yet, but I really find the LLMs to be hit or miss at summarizing transcripts. Sometimes it's impressive, sometimes it's literally "informal conversation between multiple people about various topics"


For $5 for 20 hours of audio you can try https://deepgram.com.

They give $200 of credit.


Can this translate too? As in transcribe audio and then give output in two languages?


[flagged]


Why does your website just say “ROFLMAO” — pretty hard to take seriously.


I'll refer you to the site guidelines: https://news.ycombinator.com/newsguidelines.html#comments

I will follow the rules and reserve the snark about http vs smtp




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: