Show HN: Bulk Creation of Transcripts from YouTube Playlists with Whisper

tehlike · on Nov 13, 2023

You might want to look into diarization also http://gladia.io/ seem to be doing it well.

It makes a great difference to have transcripts with speaker annotation.

eigenvalue · on Nov 13, 2023

Thanks, I haven’t seen an easy and reliable way to do this using open source stuff yet. Theoretically just separating out speakers seems like it wouldn’t be that hard; just compute a bunch of FFTs to arrive at a sort of frequency-based “voice fingerprint” for each speaker and then use something like XGboost to match up the audio for each second to one of the speakers. The problem is then what do you with that information? Turning those abstract speaker identifications into actual names would seem to require a fair bit of intelligence and picking up from contextual clues (like if the speaker identifies themselves or introduces another person by name). Anyway, I’ll look into it more. If it could be done reliably without overly complicating the setup, I agree that it would be useful.

FanaHOVA · on Nov 13, 2023

I built it for our podcast, I'm sure you can re-use it for YT as a source rather than raw audio file: https://github.com/FanaHOVA/smol-podcaster

jimmySixDOF · on Nov 13, 2023

The whisperX project has most of it covered they are integrating the new v3 model waiting for a ccp GGUF version also there is a workaround for the latest speaker diarization but it has an active user base working on it

https://github.com/m-bain/whisperX

eigenvalue · on Nov 13, 2023

Thanks, took a look at it. Seems quite heavy though, lots of huge dependencies like pytorch and torchaudio, and seems like the speaker diarization requires a GPU if I'm not mistaken. And as another poster pointed out, it does require a Huggingface API key as well.

I wanted to keep my script lighter weight and also GPU optional (i.e., a GPU will work and make it faster, but it also works acceptably with just the CPU). I really feel in my gut that the speaker diarization doesn't need to be so complicated or hard once you already have the accurate timestamps of each transcribed segment and the underlying audio file-- no reason why it shouldn't be able to run fine on a CPU and get good enough accuracy.

noman-land · on Nov 13, 2023

I was dismayed to learn that this requires OpenAI API keys for speaker diarization.

vsnf · on Nov 13, 2023

Does it? I only see references to HuggingFace api keys so it can do a one time download of some additional models.

noman-land · on Nov 13, 2023

Oh, maybe that's what I saw. I'll have to look again. Thanks for keeping me honest.

tehlike · on Nov 13, 2023

That part can be user input, if it needs to be. Sort of like post processing.

Founder of Gladia shares some information on twitter, and i think there's some research papers you can find through that (my memory is fuzzy on when). IT's not a simple problem, especially when words include space fillers like "umm" etc.

For me main use for such cases would be podcast. sometimes i just want to read them without listening.

pavel_lishin · on Nov 16, 2023

> Turning those abstract speaker identifications into actual names would seem to require a fair bit of intelligence

As someone looking for this functionality, this is the easiest part for me to do manually. Just give me "Speaker A, Speaker B, Speaker C", and I can change their names. It's the breaking apart of audio into separate speakers that's difficult - I'm trying out a few different tools, and none of them do a great job so far.

huac · on Nov 13, 2023

to zoom out a bit, this is known as the 'cocktail party problem' and is very much considered unsolved (particularly when there is an unknown number of speakers or overlapping speech).

BetterWhisper · on Nov 13, 2023

Wow, why are they so expensive? Like even the regular whisperAPI by OpenAI is less expensive.

This is also why I decided to create https://www.betterwhisperapi.com/ . I believe most of the companies are charging pretty insane amounts for transcriptions...

tornato7 · on Nov 13, 2023

Deepgram is really good and around your price point too. They also have $200 free credit which should be more than enough for most hobby protects.

skeptrune · on Nov 13, 2023

This is awesome man. We attempted to build something similar and wound up giving up and pivoting to transcripts w/ a punctuation model to enhance them.

If this was around at the time, we likely would have been able to make audio work.

Kudos for your work on this. Seems truly well architected and thought out. The spacy integration is especially awesome.

eigenvalue · on Nov 13, 2023

Thanks! I spent a decent amount of time messing around with regex nonsense before I realized Spacy could work for this. I decided to leave the regex approach in as an option anyway since it still works reasonably well and is lighter weight.

Demcox · on Nov 13, 2023

Hoping this can help me cut down the time I need to use on watching YT videos for uni. Outputting 20-30 mins into a .txt and feeding it to ChatGPT for summarizing. Thanks!

evanreichard · on Nov 13, 2023

Coincidentally I threw something together this weekend that attempts to do just that. [0] It's really simple - just extracts subtitles and feeds it to ChatGPT to generate a markdown "article".

[0] https://vreader.va.reichard.io/

[1] https://gitea.va.reichard.io/evan/VReader

jdthedisciple · on Nov 14, 2023

This is interesting. I assume the samples on the website you linked are all precomputed outputs, right?

evanreichard · on Nov 14, 2023

Well it's live, so you can throw a YouTube URL into it right now and get an auto generated article from GPT-3.5 in about 15 seconds. The items on the site right now are myself / others using it. Depends on the video length, but each generated article costs about $0.005 in API usage.

Edit: Ah just understood your question (coffee just kicked in). Yes, they're precomputed from when others used it. I save the generated articles in markdown format and sort by most recently generated.

jdthedisciple · on Nov 13, 2023

Not sure why people are downvoting.

I can confirm for some types of lectures this is a wholly legitimate approach.

ldenoue · on Nov 14, 2023

If you want to try scribe I developed last week, it adds punctuations from the raw YouTube transcripts so you can read them more easily. It also adds chapters every 3 paragraphs to more easily skim the content. All runs in your browser using 2 models https://www.appblit.com/scribe

chiefalchemist · on Nov 13, 2023

Not for uni but I was thinking similar. Find or create a playlist, have this tool transcribe it, feed all that into ChatGPT or similar, have to output a summary, either brief (i.e., highlights) or full depth & breadth of all the videos.

chiefalchemist · on Nov 13, 2023

I don't have any experience with Python. Can someone point me to definitive (and idiot-proof) tutorials for Win 10 Pro and Mac OS? Maybe using Docker?

I just don't know but am willing to try. I'd rather ask here than be subjected to a search and its SEO hell (read: SERPs of questionable results).

Or perhaps there's a way to use Digial Ocean or similar so I'm not tying up my local machine?

eigenvalue · on Nov 13, 2023

Just install Anaconda:

https://www.anaconda.com/download

For windows or Mac. If you do it on Windows it’s probably easier if you let it add it to your Windows system path. Then you should be able to open PowerShell and type “python” and not get an error. Once you can do that, just run these commands on Windows:

  git clone https://github.com/Dicklesworthstone/bulk_transcribe_youtube_videos_from_playlist
  cd bulk_transcribe_youtube_videos_from_playlist
  python -m venv venv
  cd venv
  .\Scripts\activate  
  python -m pip install --upgrade pip
  python -m pip install wheel
  cd ..
  pip install -r requirements.txt

Then edit the file " bulk_transcribe_youtube_videos_from_playlist.py" to modify the URL (if you want a playlist, edit that URL and change the `convert_single_video = 1` part to be 0 instead) and finally run it with:

  python bulk_transcribe_youtube_videos_from_playlist.py

mkmk · on Nov 13, 2023

You could ask ChatGPT for help, too!

chiefalchemist · on Nov 13, 2023

Yes. But it can be wrong, especially for key details. So instead, I asked here. My hope is to keep friction down and productivity up.

vfclists · on Nov 13, 2023

Didn't some well known AI researcher create a compact version of Whisper based on C++ which was posted here some time ago?

eigenvalue · on Nov 13, 2023

Yes, this uses a variant of that called faster-whisper.

josephrmartinez · on Nov 14, 2023

This is nice! I like how you've built upon your previous youtube transcript cleaner project and put together something really compelling.

Have you thought about spinning this together as a Chrome extension?

eigenvalue · on Nov 14, 2023

Thanks. The html cleaning tool could easily be a chrome extension, but I don’t see how you could have a python project with a bunch of heavy binary wheels as part of an extension. But I can’t say I know much about it.

mannyv · on Nov 14, 2023

FYI, one secret in the world of automatic translation is "the better the transcription the better the translation."

Seems obvious, right?

We've been looking at whisper to replace AWS transcribe. It could be that AWS will just roll whisper in as an engine at some point before we get around to it.

barefeg · on Nov 13, 2023

Out of curiosity, how does it compare to YouTube’s own generated transcripts?

eigenvalue · on Nov 13, 2023

I would say that overall they are much, much better than the auto generated ones from YouTube. If the speaker speaks incredibly clearly and slowly, without slang, etc, then the built in ones are good enough. But in a tougher situation, the biggest whisper model achieves near superhuman accuracy— way better.

kawsper · on Nov 13, 2023

I found the YouTube one to be really bad for my voice and the way I speak - Whisper does it perfectly (using the large dataset).

English is my second language, and I mumble.

BetterWhisper · on Nov 13, 2023

While it seems YouTube's auto-generated are hit or miss, I wonder if feeding them through an LLM can fix the mistakes and still get the video's idea out of them

josephrmartinez · on Nov 14, 2023

I've found that to be the case. I typically don't want a full transcript -- I want the materials list, or a summary, or a counterargument. I've found it is totally sufficient to just plop the transcript into an LLM and ask for my desired output. No need to clean of the transcript ahead of time.

tsurba · on Nov 13, 2023

Whisper is generally better than the one in youtube.

jfoster · on Nov 13, 2023

I could've sworn I had seen a Google/YouTube announcement somewhere that there was going to be readable/searchable transcripts coming. Is it what's already been rolled out? The current YouTube transcripts seem almost useless to me; limited to a small part of the screen real estate, and seem only searchable using the full page search built into web browsers.

fy20 · on Nov 13, 2023

In the description of videos I see a "Show Transcript" button. I believe this is automatically generated. It's not grouped into sentences though, so there is room for improvement.

eigenvalue · on Nov 13, 2023

These are the automatic transcripts generated by YouTube mentioned by other commenters. Their accuracy leaves something to be desired compared to Whisper based transcripts.

jfoster · on Nov 13, 2023

Right, but it opens up a small page element that isn't very aesthetically pleasing to read from, and feels fairly useless.

eigenvalue · on Nov 13, 2023

That’s true. But I coincidentally also made another mini project a few months ago for making those built-in transcripts much nicer to read from. You may find it useful:

https://github.com/Dicklesworthstone/youtube_transcript_clea...

Now that I think about it, this would work equally well for the transcripts generated by my new tool. I should just include that html file in my new repo as an added feature.

ldenoue · on Nov 14, 2023

I also wrote a web app but it adds missing punctuations and chapters to further enhance readability. The AI models run locally in your browser, no ChatGPT used at all.

An iOS app is also coming soon so you’ll be able to listen and read while offline

https://www.appblit.com/scribe