Self-hosted offline transcription and diarization service with LLM summary

__jonas · 2024-05-26T23:21:03 1716765663

I've been using this:

https://github.com/bugbakery/transcribee

It's noticeably work-in-progress but it does the job and has a nice UI to edit transcriptions and speakers etc.

It's running on the CPU for me, would be nice to have something that can make use of a 4GB Nvidia GPU, which faster-whisper is actually able to [1]

https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file...

gooseyman · 2024-05-27T02:16:10 1716776170

I couldn’t get the container to run for the life of me.

__jonas · 2024-05-27T07:44:40 1716795880

Yeah the docker setup is broken, I just worked around it locally

https://github.com/bugbakery/transcribee/issues/427#issuecom...

rimple · 2024-05-26T22:53:41 1716764021

That's cool. I've created a website(https://papertube.site) that essentially transcribes video conversations for reading on Kindle. Right now, I'm relying on third-party APIs, but I was thinking about self-hosting to reduce costs.

3abiton · 2024-05-26T22:59:20 1716764360

It's like reverse audio-book, but how do you tackle issues related to video content, as the visual medium contains more information dimension than just sound.

rimple · 2024-05-27T00:26:04 1716769564

Not handling video content here. The focus is on cases where no visual content is needed, such as podcasts, TED talks, and conversations.

fragmede · 2024-05-26T23:06:08 1716764768

oh that's awesome. I wanted to build that as well. tiny nit pick, left aligned is easier to read than justified, imo.

pyxelperfect · 2024-05-27T12:03:30 1716811410

Are you interested in collaboration or JV?

sunnynagam · 2024-05-27T16:31:21 1716827481

Not op but what's JV?

banditelol · 2024-05-28T02:59:07 1716865147

From the context it seems to mean Joint Venture?

siruva07 · 2024-05-26T22:25:01 1716762301

Built something similar for podcasts

https://www.podsnacks.org/

keineid · 2024-05-27T01:47:44 1716774464

Anybody have recommendations for an easy way to grab "outbound" audio regardless of source? I meet with clients on a wide range of platforms and would love to be able to universally grab their audio to use here, regardless of what platform we're in. I know there's plenty of services, but would love to keep it all local.

password4321 · 2024-05-27T02:14:57 1716776097

BlackHole: macOS Audio Loopback Driver

https://news.ycombinator.com/item?id=40270219

Not sure what you mean by source/platforms, you might find what you need for your operating system on that discussion 3 weeks ago with links to options for macOS, Linux, and Windows.

c0brac0bra · 2024-05-27T13:01:36 1716814896

Keep in mind that Blackhole has to be licensed for commercial use

keineid · 2024-05-27T03:12:40 1716779560

Ha, forgot to specify Windows. I'll dig into the discussion, thanks!

bt1a · 2024-05-27T08:13:18 1716797598

I do what you're describing with VB-CABLE on Windows (and Blackhole-2ch on macOS) + OBS

https://vb-audio.com/Cable/index.htm

So, if you wanted to transcribe the audio from one of your physical output devices (say you had your meeting software outputting on external headphones), you could set the virtual audio device as the monitoring device on the physical external headphones. Therefore, you end up with a virtual audio input device containing the audio from your meeting software. I also do this to apply a chain of filters to my condenser mic in OBS because it picks up everything.

BeefySwain · 2024-05-26T21:55:09 1716760509

I was able to build something that does all this, more or less, in a couple weeks. It works really well.

I wanted to be able to transcribe and diarize in realtime though, which is much harder. Didn't manage to make that happen.

bitshaker · 2024-05-26T20:03:20 1716753800

Amazing. I’ll see if I can get this working on Mac too. I have so many use cases for this.

30 years of audio that needs transcribing, summaries, and worksheets made out of them.

genewitch · 2024-05-27T02:45:28 1716777928

Whisper works ... kinda. I'm hoping there's another set of models released at some point, the error rate isn't appalling to me because i am transcribing TV shows and radio shows for personal use, so it's not mission critical.

There are a few whisper diarization "projects" but i've never been able to get it to work. Whisper does have word-level timestamps, so it should be simple to "plug in" diarization.

I don't need an LLM or whatever this project has, but i will see if it's runnable and if it's any better than what a couple podcasts i listen to use.

edit: see some people mentioning whisperx, which is one of those things that was cool until moving fast broke things:

>As of Oct 11, 2023, there is a known issue regarding slow performance with pyannote/Speaker-Diarization-3.0 in whisperX. It is due to dependency conflicts between faster-whisper and pyannote-audio 3.0.0. Please see this issue for more details and potential workarounds.

which means that what i gain is a ~3x increase in large-v2 speeds but i instantly lose those gains with diarization, unless i track down 8 month old bug workarounds.

I'll stick with the py venv whisper install i've been using for the last 16 months, tyvm

forgingahead · 2024-05-27T04:14:00 1716783240

Re: Diarization, I had decent results with testing this on Colab a while ago:

https://github.com/MahmoudAshraf97/whisper-diarization

I remember having the usual python package hell when NeMo was updated somewhere, but it seems to be decently well maintained so give it a go.

*Edit, I remember reading somewhere that pyannote was a weak link in other repos, that might be why your other tests were not great.

toomuchtodo · 2024-05-26T21:21:52 1716758512

I would love to hear more about your use case!

SubiculumCode · 2024-05-27T01:56:34 1716774994

I was just building out one of these for myself and working through some dependency issues! Cool!!

SergeAx · 2024-05-27T22:06:26 1716847586

Great hack, like it so much, thank you! Out of curiosity: transcription and dizrization are very similar processes, the latter just adds "Speaker 1/2/3" to each paragraph. Why two different workflows?

101008 · 2024-05-26T23:08:23 1716764903

Is tehre a transcription engine (?) that works on Javascript? I'd love to make a browser extension that allows me to transcript Whatsapp audios instead of having to listen to them.

c0brac0bra · 2024-05-26T23:19:50 1716765590

There's examples of whisper.cpp that run in the browser. Pretty laggy but they work.

indigodaddy · 2024-05-27T00:28:21 1716769701

https://whisper.ggerganov.com/

ranger_danger · 2024-05-26T20:59:13 1716757153

I thought local LLMs were unable to summarize large documents due to limited token counts or something like that? Can someone ELI5?

icelancer · 2024-05-26T21:05:05 1716757505

You batch them. If token limit is 32k for example, you summarize them in batches of 32k tokens (inc. output) then summarize all the partial summaries.

It's what we were doing at our company until Anthropic and others released larger context window LLMs. We do the TTS locally (whisperX) and the summarization via API. Though we've tried with local LLMs, too.

phh · 2024-05-26T21:06:17 1716757577

Well it'll always depend on the length of the meeting to summarize. But they are using mistral which clocks at 32k context. With an average of 150 spoken words per minute, 1 token ~= word (which is rather pessimistic), that's 3h30m of meeting. So I guess that's okay?

rahimnathwani · 2024-05-27T00:23:37 1716769417

  mistral which clocks at 32k context

I may be wrong, but my understanding was/is:

- Mistral can handle 32k context, but only using sliding window attention. So it can't really process all 32k tokens at once.

- Mixtral (note the 'x') 8x7B can handle 32k context without resorting to sliding window attention.

I wonder whether Mistral would do a better job summarizing a long (32k token) doc all at once, or using recursive summarization.

icelancer · 2024-05-27T00:33:35 1716770015

Hmm. Interesting question. We had no issues using Mixtral 8x7B for this, perhaps reinforcing your point. We use fine-tuned Mistral-7B instances but not for long context stuff.

Maybe a neat eval to try.

lbrito · 2024-05-26T20:37:46 1716755866

What is the cost compared with something like Whisper API? Assuming one would use commodity cloud GPUs for self hosting

seligman99 · 2024-05-26T22:01:57 1716760917

WhisperX along with whisper-diarization, runs at something around 20x of real time on audio with a modern GPU, so for that part, you're looking at around $1 per twenty hours of content to run it on a g5.xlarge, not counting time to build up a node (or around 1/2 that for Spot prices, assuming you're much luckier than I am at getting stable spot instances these days).

You can short circuit that time to build up a node a bit with a prebaked AMI on AWS, but there's still some amount of time before a new node can start running at speed, around 10 minutes in my experience.

I haven't looked at this particular solution yet, but I really find the LLMs to be hit or miss at summarizing transcripts. Sometimes it's impressive, sometimes it's literally "informal conversation between multiple people about various topics"

c0brac0bra · 2024-05-27T13:05:08 1716815108

For $5 for 20 hours of audio you can try https://deepgram.com.

They give $200 of credit.

lloydatkinson · 2024-05-26T21:07:42 1716757662

Can this translate too? As in transcribe audio and then give output in two languages?

pyxelperfect · 2024-05-27T11:59:47 1716811187

[flagged]

ceinewydd · 2024-05-27T14:21:39 1716819699

Why does your website just say “ROFLMAO” — pretty hard to take seriously.

pyxelperfect · 2024-05-28T09:04:29 1716887069

I'll refer you to the site guidelines: https://news.ycombinator.com/newsguidelines.html#comments

I will follow the rules and reserve the snark about http vs smtp