I know there are various tools that are supposed to make this easy, but I couldn't find anything that did everything I wanted, so I made this today for fun. The web-based offerings all take forever and seem flaky, and you need to process one video at a time, with no control over the transcription settings. In contrast, my script lets you convert a whole playlist in bulk with full control over everything.
It's truly easy to use-- you can clone the repo, install to a venv, and be generating a folder full of high quality transcript text files in under 5 minutes. All you need to do is supply the URL to a YouTube playlist or to an individual video file and this tool does the rest automatically. It uses faster-whisper with a high beam_size, so it's a bit slower than you might expect, but this does result in higher accuracy. The best way to use this is to take an existing playlist, or create a new one on YouTube, start this script up, and come back the next morning with all your finished transcripts. It attempts to "upgrade" the output of whisper by taking all the transcript segments, gluing them together, and then splitting them back into sentences (it uses Spacy for this, or a simpler regex-based function). You end up with a single text file with the full transcript all ready to go for each video in the playlist, with a sensible file name based on the title of the video.
If you have CUDA installed, it will try to use it, but as with all things CUDA, it's annoyingly fragile and picky, so don't be surprised if you get a CUDA error even if you know for a fact CUDA is installed on your system. If you're looking for reliability, disable CUDA. But if you need to transcribe a LOT of transcripts, it does go much, much faster on a GPU.
Even if you don't have a GPU, if you have a powerful machine with a lot of RAM and cores, this script will fully saturate them and can download and process multiple videos at the same time. The default settings are pretty good for that situation. But if you have a slower machine, you might want to use a smaller Whisper model (like `base.en` or even `tiny.en`) and dial down the beam_size to 2.
It makes a great difference to have transcripts with speaker annotation.