While this approach may seem simpler, this project method utilizes a more optimized and faster model, resulting in improved efficiency and performance.
I used whisper-cpp on an atom Netbook to transcript some short and old UK video from the 60's into written English (I am not a native speaker). I think it lasted lasted than an hour.
I was surprised to see there were no ML-related dependencies (neither models nor libraries), so I had a look at the code: The models are downloaded from Huggingface, and the repo comes with a precompiled whisper.cpp binary to execute them.
I have a question: I have 200-300 hours of audio recordings of interviews. I an using Otter.ai to automate transcription, and for each recording I export a ".vtt" file of the transcript.
What I'd like to do is create a type of ebook of all these transcripts, where if I click on a word, then the corresponding audio will start playing from roughly the same point in time within the interview.
Otter can do this already (if I'm online and logged in to their website), but I don't want to be tied to their website forever. I'd like to have a local copy that can perform similarly. Amazon ebooks can do this as well, I believe, where there is a corresponding verbatim audiobook. However, this project of mine is purely personal. I won't be selling my audio recordings or transcripts.
Any advice? Could software discussed here be helpful in what I'm trying to accomplish?
If you already have a .vtt, this is not a hard exercise to do e.g. entirely in a browser: parse the .vtt (they're simple text), lay out the text as you like with each segment being a clickable element (e.g. a link), and hook that up to seek an `<audio>` element to where you like.
Thank you. I am not technical to that extent myself to implement it, but I know enough to understand very well what you mean. I’ll hire a contractor, and this helps me communicate effectively.
AFAIK Whisper still can't handle multi-language content. If the audio has two languages (different narrators, for example), Whisper transcribes both of them during the first minute or so, and then either entirely skips one of the languages, or translates the foreign language to English, for the rest of the audio.
So, the value proposition of a subtitle-generating wrapper for Whisper would be to have an option to split audio into ~1 minute segments, transcribe them separately, and to somehow accurately join them. And I don’t think this one does such a thing.
I could see myself using this, subtitling things is extremely time consuming and there aren't that many tools which will automate it for you. It looks pretty straightforward to use - just two steps to install (if you already have FFMpeg and Python), and then one command to run the script.
Well done!
I wonder how much more a model would learn about subtitles from including audio AND video in training. Sure, the costs would be way bigger (parsing video even deterministically is 1.5 orders of magnitude worse than audio) but it might help with the edge cases where the speech is so unclear even the subtitle scene can't agree.
I'm not a native English speaker and I tend to use the LiveCaption application in Linux when I attend English speaking online meetings. Would love to have the opportunity to have subtitles in my native language (Greek) too while doing so.
I do the same with tech oriented podcasts. They have a clear speech, so transcribing them right it's very easy to do.
Non-native English speaker here, too.
[0]: https://github.com/openai/whisper/blob/e58f28804528831904c3b...