That's cool. I've created a website(https://papertube.site) that essentially transcribes video conversations for reading on Kindle. Right now, I'm relying on third-party APIs, but I was thinking about self-hosting to reduce costs.
It's like reverse audio-book, but how do you tackle issues related to video content, as the visual medium contains more information dimension than just sound.
Anybody have recommendations for an easy way to grab "outbound" audio regardless of source? I meet with clients on a wide range of platforms and would love to be able to universally grab their audio to use here, regardless of what platform we're in. I know there's plenty of services, but would love to keep it all local.
Not sure what you mean by source/platforms, you might find what you need for your operating system on that discussion 3 weeks ago with links to options for macOS, Linux, and Windows.
So, if you wanted to transcribe the audio from one of your physical output devices (say you had your meeting software outputting on external headphones), you could set the virtual audio device as the monitoring device on the physical external headphones. Therefore, you end up with a virtual audio input device containing the audio from your meeting software. I also do this to apply a chain of filters to my condenser mic in OBS because it picks up everything.
Whisper works ... kinda. I'm hoping there's another set of models released at some point, the error rate isn't appalling to me because i am transcribing TV shows and radio shows for personal use, so it's not mission critical.
There are a few whisper diarization "projects" but i've never been able to get it to work. Whisper does have word-level timestamps, so it should be simple to "plug in" diarization.
I don't need an LLM or whatever this project has, but i will see if it's runnable and if it's any better than what a couple podcasts i listen to use.
edit: see some people mentioning whisperx, which is one of those things that was cool until moving fast broke things:
>As of Oct 11, 2023, there is a known issue regarding slow performance with pyannote/Speaker-Diarization-3.0 in whisperX. It is due to dependency conflicts between faster-whisper and pyannote-audio 3.0.0. Please see this issue for more details and potential workarounds.
which means that what i gain is a ~3x increase in large-v2 speeds but i instantly lose those gains with diarization, unless i track down 8 month old bug workarounds.
I'll stick with the py venv whisper install i've been using for the last 16 months, tyvm
Great hack, like it so much, thank you! Out of curiosity: transcription and dizrization are very similar processes, the latter just adds "Speaker 1/2/3" to each paragraph. Why two different workflows?
Is tehre a transcription engine (?) that works on Javascript? I'd love to make a browser extension that allows me to transcript Whatsapp audios instead of having to listen to them.
You batch them. If token limit is 32k for example, you summarize them in batches of 32k tokens (inc. output) then summarize all the partial summaries.
It's what we were doing at our company until Anthropic and others released larger context window LLMs. We do the TTS locally (whisperX) and the summarization via API. Though we've tried with local LLMs, too.
Well it'll always depend on the length of the meeting to summarize. But they are using mistral which clocks at 32k context. With an average of 150 spoken words per minute, 1 token ~= word (which is rather pessimistic), that's 3h30m of meeting. So I guess that's okay?
Hmm. Interesting question. We had no issues using Mixtral 8x7B for this, perhaps reinforcing your point. We use fine-tuned Mistral-7B instances but not for long context stuff.
WhisperX along with whisper-diarization, runs at something around 20x of real time on audio with a modern GPU, so for that part, you're looking at around $1 per twenty hours of content to run it on a g5.xlarge, not counting time to build up a node (or around 1/2 that for Spot prices, assuming you're much luckier than I am at getting stable spot instances these days).
You can short circuit that time to build up a node a bit with a prebaked AMI on AWS, but there's still some amount of time before a new node can start running at speed, around 10 minutes in my experience.
I haven't looked at this particular solution yet, but I really find the LLMs to be hit or miss at summarizing transcripts. Sometimes it's impressive, sometimes it's literally "informal conversation between multiple people about various topics"
https://github.com/bugbakery/transcribee
It's noticeably work-in-progress but it does the job and has a nice UI to edit transcriptions and speakers etc.
It's running on the CPU for me, would be nice to have something that can make use of a 4GB Nvidia GPU, which faster-whisper is actually able to [1]
https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file...