Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Bulk Creation of Transcripts from YouTube Playlists with Whisper (github.com/dicklesworthstone)
125 points by eigenvalue on Nov 13, 2023 | hide | past | favorite | 43 comments
I know there are various tools that are supposed to make this easy, but I couldn't find anything that did everything I wanted, so I made this today for fun. The web-based offerings all take forever and seem flaky, and you need to process one video at a time, with no control over the transcription settings. In contrast, my script lets you convert a whole playlist in bulk with full control over everything.

It's truly easy to use-- you can clone the repo, install to a venv, and be generating a folder full of high quality transcript text files in under 5 minutes. All you need to do is supply the URL to a YouTube playlist or to an individual video file and this tool does the rest automatically. It uses faster-whisper with a high beam_size, so it's a bit slower than you might expect, but this does result in higher accuracy. The best way to use this is to take an existing playlist, or create a new one on YouTube, start this script up, and come back the next morning with all your finished transcripts. It attempts to "upgrade" the output of whisper by taking all the transcript segments, gluing them together, and then splitting them back into sentences (it uses Spacy for this, or a simpler regex-based function). You end up with a single text file with the full transcript all ready to go for each video in the playlist, with a sensible file name based on the title of the video.

If you have CUDA installed, it will try to use it, but as with all things CUDA, it's annoyingly fragile and picky, so don't be surprised if you get a CUDA error even if you know for a fact CUDA is installed on your system. If you're looking for reliability, disable CUDA. But if you need to transcribe a LOT of transcripts, it does go much, much faster on a GPU.

Even if you don't have a GPU, if you have a powerful machine with a lot of RAM and cores, this script will fully saturate them and can download and process multiple videos at the same time. The default settings are pretty good for that situation. But if you have a slower machine, you might want to use a smaller Whisper model (like `base.en` or even `tiny.en`) and dial down the beam_size to 2.




You might want to look into diarization also http://gladia.io/ seem to be doing it well.

It makes a great difference to have transcripts with speaker annotation.


Thanks, I haven’t seen an easy and reliable way to do this using open source stuff yet. Theoretically just separating out speakers seems like it wouldn’t be that hard; just compute a bunch of FFTs to arrive at a sort of frequency-based “voice fingerprint” for each speaker and then use something like XGboost to match up the audio for each second to one of the speakers. The problem is then what do you with that information? Turning those abstract speaker identifications into actual names would seem to require a fair bit of intelligence and picking up from contextual clues (like if the speaker identifies themselves or introduces another person by name). Anyway, I’ll look into it more. If it could be done reliably without overly complicating the setup, I agree that it would be useful.


I built it for our podcast, I'm sure you can re-use it for YT as a source rather than raw audio file: https://github.com/FanaHOVA/smol-podcaster


The whisperX project has most of it covered they are integrating the new v3 model waiting for a ccp GGUF version also there is a workaround for the latest speaker diarization but it has an active user base working on it

https://github.com/m-bain/whisperX


Thanks, took a look at it. Seems quite heavy though, lots of huge dependencies like pytorch and torchaudio, and seems like the speaker diarization requires a GPU if I'm not mistaken. And as another poster pointed out, it does require a Huggingface API key as well.

I wanted to keep my script lighter weight and also GPU optional (i.e., a GPU will work and make it faster, but it also works acceptably with just the CPU). I really feel in my gut that the speaker diarization doesn't need to be so complicated or hard once you already have the accurate timestamps of each transcribed segment and the underlying audio file-- no reason why it shouldn't be able to run fine on a CPU and get good enough accuracy.


I was dismayed to learn that this requires OpenAI API keys for speaker diarization.


Does it? I only see references to HuggingFace api keys so it can do a one time download of some additional models.


Oh, maybe that's what I saw. I'll have to look again. Thanks for keeping me honest.


That part can be user input, if it needs to be. Sort of like post processing.

Founder of Gladia shares some information on twitter, and i think there's some research papers you can find through that (my memory is fuzzy on when). IT's not a simple problem, especially when words include space fillers like "umm" etc.

For me main use for such cases would be podcast. sometimes i just want to read them without listening.


> Turning those abstract speaker identifications into actual names would seem to require a fair bit of intelligence

As someone looking for this functionality, this is the easiest part for me to do manually. Just give me "Speaker A, Speaker B, Speaker C", and I can change their names. It's the breaking apart of audio into separate speakers that's difficult - I'm trying out a few different tools, and none of them do a great job so far.


to zoom out a bit, this is known as the 'cocktail party problem' and is very much considered unsolved (particularly when there is an unknown number of speakers or overlapping speech).


Wow, why are they so expensive? Like even the regular whisperAPI by OpenAI is less expensive.

This is also why I decided to create https://www.betterwhisperapi.com/ . I believe most of the companies are charging pretty insane amounts for transcriptions...


Deepgram is really good and around your price point too. They also have $200 free credit which should be more than enough for most hobby protects.


This is awesome man. We attempted to build something similar and wound up giving up and pivoting to transcripts w/ a punctuation model to enhance them.

If this was around at the time, we likely would have been able to make audio work.

Kudos for your work on this. Seems truly well architected and thought out. The spacy integration is especially awesome.


Thanks! I spent a decent amount of time messing around with regex nonsense before I realized Spacy could work for this. I decided to leave the regex approach in as an option anyway since it still works reasonably well and is lighter weight.


Hoping this can help me cut down the time I need to use on watching YT videos for uni. Outputting 20-30 mins into a .txt and feeding it to ChatGPT for summarizing. Thanks!


Coincidentally I threw something together this weekend that attempts to do just that. [0] It's really simple - just extracts subtitles and feeds it to ChatGPT to generate a markdown "article".

[0] https://vreader.va.reichard.io/

[1] https://gitea.va.reichard.io/evan/VReader


This is interesting. I assume the samples on the website you linked are all precomputed outputs, right?


Well it's live, so you can throw a YouTube URL into it right now and get an auto generated article from GPT-3.5 in about 15 seconds. The items on the site right now are myself / others using it. Depends on the video length, but each generated article costs about $0.005 in API usage.

Edit: Ah just understood your question (coffee just kicked in). Yes, they're precomputed from when others used it. I save the generated articles in markdown format and sort by most recently generated.


Not sure why people are downvoting.

I can confirm for some types of lectures this is a wholly legitimate approach.


If you want to try scribe I developed last week, it adds punctuations from the raw YouTube transcripts so you can read them more easily. It also adds chapters every 3 paragraphs to more easily skim the content. All runs in your browser using 2 models https://www.appblit.com/scribe


Not for uni but I was thinking similar. Find or create a playlist, have this tool transcribe it, feed all that into ChatGPT or similar, have to output a summary, either brief (i.e., highlights) or full depth & breadth of all the videos.


I don't have any experience with Python. Can someone point me to definitive (and idiot-proof) tutorials for Win 10 Pro and Mac OS? Maybe using Docker?

I just don't know but am willing to try. I'd rather ask here than be subjected to a search and its SEO hell (read: SERPs of questionable results).

Or perhaps there's a way to use Digial Ocean or similar so I'm not tying up my local machine?


Just install Anaconda:

https://www.anaconda.com/download

For windows or Mac. If you do it on Windows it’s probably easier if you let it add it to your Windows system path. Then you should be able to open PowerShell and type “python” and not get an error. Once you can do that, just run these commands on Windows:

  git clone https://github.com/Dicklesworthstone/bulk_transcribe_youtube_videos_from_playlist
  cd bulk_transcribe_youtube_videos_from_playlist
  python -m venv venv
  cd venv
  .\Scripts\activate  
  python -m pip install --upgrade pip
  python -m pip install wheel
  cd ..
  pip install -r requirements.txt
Then edit the file " bulk_transcribe_youtube_videos_from_playlist.py" to modify the URL (if you want a playlist, edit that URL and change the `convert_single_video = 1` part to be 0 instead) and finally run it with:

  python bulk_transcribe_youtube_videos_from_playlist.py


You could ask ChatGPT for help, too!


Yes. But it can be wrong, especially for key details. So instead, I asked here. My hope is to keep friction down and productivity up.


Didn't some well known AI researcher create a compact version of Whisper based on C++ which was posted here some time ago?


Yes, this uses a variant of that called faster-whisper.


This is nice! I like how you've built upon your previous youtube transcript cleaner project and put together something really compelling.

Have you thought about spinning this together as a Chrome extension?


Thanks. The html cleaning tool could easily be a chrome extension, but I don’t see how you could have a python project with a bunch of heavy binary wheels as part of an extension. But I can’t say I know much about it.


FYI, one secret in the world of automatic translation is "the better the transcription the better the translation."

Seems obvious, right?

We've been looking at whisper to replace AWS transcribe. It could be that AWS will just roll whisper in as an engine at some point before we get around to it.


Out of curiosity, how does it compare to YouTube’s own generated transcripts?


I would say that overall they are much, much better than the auto generated ones from YouTube. If the speaker speaks incredibly clearly and slowly, without slang, etc, then the built in ones are good enough. But in a tougher situation, the biggest whisper model achieves near superhuman accuracy— way better.


I found the YouTube one to be really bad for my voice and the way I speak - Whisper does it perfectly (using the large dataset).

English is my second language, and I mumble.


While it seems YouTube's auto-generated are hit or miss, I wonder if feeding them through an LLM can fix the mistakes and still get the video's idea out of them


I've found that to be the case. I typically don't want a full transcript -- I want the materials list, or a summary, or a counterargument. I've found it is totally sufficient to just plop the transcript into an LLM and ask for my desired output. No need to clean of the transcript ahead of time.


Whisper is generally better than the one in youtube.


I could've sworn I had seen a Google/YouTube announcement somewhere that there was going to be readable/searchable transcripts coming. Is it what's already been rolled out? The current YouTube transcripts seem almost useless to me; limited to a small part of the screen real estate, and seem only searchable using the full page search built into web browsers.


In the description of videos I see a "Show Transcript" button. I believe this is automatically generated. It's not grouped into sentences though, so there is room for improvement.


These are the automatic transcripts generated by YouTube mentioned by other commenters. Their accuracy leaves something to be desired compared to Whisper based transcripts.


Right, but it opens up a small page element that isn't very aesthetically pleasing to read from, and feels fairly useless.


That’s true. But I coincidentally also made another mini project a few months ago for making those built-in transcripts much nicer to read from. You may find it useful:

https://github.com/Dicklesworthstone/youtube_transcript_clea...

Now that I think about it, this would work equally well for the transcripts generated by my new tool. I should just include that html file in my new repo as an added feature.


I also wrote a web app but it adds missing punctuations and chapters to further enhance readability. The AI models run locally in your browser, no ChatGPT used at all.

An iOS app is also coming soon so you’ll be able to listen and read while offline

https://www.appblit.com/scribe




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: