Hacker News new | past | comments | ask | show | jobs | submit login
Subtitle is now open-source (vedgupta.in)
117 points by yugtripathi on Nov 24, 2023 | hide | past | favorite | 37 comments



Whisper already generates subtitles[0], supporting VTT and SRT so this is just a thin wrapper around that.

[0]: https://github.com/openai/whisper/blob/e58f28804528831904c3b...


Yeah, you can just do whisper --output_format vtt, so I have no idea what this wrapper even adds.

I wonder if the whole thing is just an AI-generated project. The "About Me" section is pretty illuminating (unabridged):

> I'm a Developer i will feel the code then write.


While this approach may seem simpler, this project method utilizes a more optimized and faster model, resulting in improved efficiency and performance.


I looked through the source and have to agree with ipsum2... where is the part that isn't just wrapper code?


I used whisper-cpp on an atom Netbook to transcript some short and old UK video from the 60's into written English (I am not a native speaker). I think it lasted lasted than an hour.


Do you have any references to back that up? To me this just sounds like a sale pitch.


All of OP's three posts point to the same developer. Just saying.


Yes, this is a sales pitch


I hate to be blunt but it seems like you don't even know what you're talking about.


I was surprised to see there were no ML-related dependencies (neither models nor libraries), so I had a look at the code: The models are downloaded from Huggingface, and the repo comes with a precompiled whisper.cpp binary to execute them.


Yes, for more info check the project references


A few things I don't understand...

* What languages are supported? Is there a list?

* What does 'subtitle' do, which 'whisper' doesn't?

* How do I install this system-wide on an apt-based system (in which pip install --system doesn't work)?


it's just whisper and some code that downloads the models from huggingface


I have a question: I have 200-300 hours of audio recordings of interviews. I an using Otter.ai to automate transcription, and for each recording I export a ".vtt" file of the transcript.

What I'd like to do is create a type of ebook of all these transcripts, where if I click on a word, then the corresponding audio will start playing from roughly the same point in time within the interview.

Otter can do this already (if I'm online and logged in to their website), but I don't want to be tied to their website forever. I'd like to have a local copy that can perform similarly. Amazon ebooks can do this as well, I believe, where there is a corresponding verbatim audiobook. However, this project of mine is purely personal. I won't be selling my audio recordings or transcripts.

Any advice? Could software discussed here be helpful in what I'm trying to accomplish?


This software won't help you.

If you already have a .vtt, this is not a hard exercise to do e.g. entirely in a browser: parse the .vtt (they're simple text), lay out the text as you like with each segment being a clickable element (e.g. a link), and hook that up to seek an `<audio>` element to where you like.


Thank you. I am not technical to that extent myself to implement it, but I know enough to understand very well what you mean. I’ll hire a contractor, and this helps me communicate effectively.


This would be a less-than-an-hour hack/prototype for a competent frontend engineer, FWIW.


AFAIK Whisper still can't handle multi-language content. If the audio has two languages (different narrators, for example), Whisper transcribes both of them during the first minute or so, and then either entirely skips one of the languages, or translates the foreign language to English, for the rest of the audio.

So, the value proposition of a subtitle-generating wrapper for Whisper would be to have an option to split audio into ~1 minute segments, transcribe them separately, and to somehow accurately join them. And I don’t think this one does such a thing.


I don't know what you're thinking about but when I watch a movie I'm happy if all subtitles are in the same language :) One that I know ideally.


I could see myself using this, subtitling things is extremely time consuming and there aren't that many tools which will automate it for you. It looks pretty straightforward to use - just two steps to install (if you already have FFMpeg and Python), and then one command to run the script. Well done!


If you find this project helpful, please consider starring the repo


@dang this person is clearly using sock puppet accounts on HN.


Which person??


Very interesting, would it be (is it?) possible to output the subtitle in a different language? For example English to Icelandic?


I wonder how much more a model would learn about subtitles from including audio AND video in training. Sure, the costs would be way bigger (parsing video even deterministically is 1.5 orders of magnitude worse than audio) but it might help with the edge cases where the speech is so unclear even the subtitle scene can't agree.


[flagged]


this sounds like chatgpt drivel


nhh, Google Bard


There is also Whisperx, a modification of whisper with accurate word timing and confidence scores.

It gives pretty good subtitles.


Could really benefit from an example of what comes out the other end of it, in this article and in the repo.


Maybe an off-topic comment.

I'm not a native English speaker and I tend to use the LiveCaption application in Linux when I attend English speaking online meetings. Would love to have the opportunity to have subtitles in my native language (Greek) too while doing so.


I do the same with tech oriented podcasts. They have a clear speech, so transcribing them right it's very easy to do. Non-native English speaker here, too.


Seems not to work - it fails to generate a VTT file:

https://github.com/innovatorved/subtitle/issues/6


Thank you, I was looking for a similar tool and was surprised by how difficult it was to find something with no bloat. Will give it a shot


I've gotten good results with whisperx when I needed to generate captions. https://github.com/m-bain/whisperX

There is currently a problem with diarization, but otherwise, it is SOTA.


Thank you for checking out the project! If you find it useful, please consider starring the repository.


What hardware do I need to run this locally? What languages are supported?


Siri

I hope Siri does something to improve. It’s voice-to-text for me is still horrible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: