It would be really neat if after pulling the captions, an LLM was used to reword the content into an idiomatic "blogpost" (since speech is typically different than writing). Using LLMs, we could even choose the level of summarization and the output tone!
As someone who strongly prefers reading to watching instructional videos, I'd pay for this service :)
I made something sorta like that specific to recipe videos. Basically converts recipe into an idiomatic format (inlines ingredients, detects and renders timers) and links each step in the recipe to its timestamp in the video for easy indexing while you're busy in the kitchen. (I spent too much time trying to scrub to that one spot where "how it's supposed to look" is shown while busy making it look that way)
Cool website. Much better than the SEO spam I came across earlier this week when I did a websearch for "pear qwerty horse" after seeing it in the tags under a binging with babish video.
Love the timers and jumping to sections of the video. Though, the second video I tried viewing didn't have linked steps.
The hyperlink "Food Wishes" at the top of the page is broken. It'd be nice too if there was a way on that page to request a new recipe (via video ID or whatever).
As a way too make that easier, maybe it would be nice to support a user-specified set of timestamps? Say recipe A: 0:00-7:46, recipe B: 7:47-15:33 and so on
Except for the complaining GPT will do, and some censorship based on the whims of its' programming group. No thanks; I'll stick to scripts, where the video dictates the content.
It seems that many of "my script can do [something] with [information in a different form]" can be superseded by LLVMs already or in the near future and the quality is way better than what the scripts are capable of.
I just wonder what the price of this is. I can run most of these scripts on an old laptop. But for the LLVM I need a pricy an beefy computer or (even worse) a paid subscription to a big tech's service.
Step right up, folks! Gather around and feast your eyes on the magnificent creatures before us – the elephants! Now, what makes these majestic beings so fascinating, you ask? Well, let me tell you – it's all about their incredibly, unbelievably long... um, trunks! Yes, you heard that right. These gentle giants sport trunks that seem to stretch on for ages, and let me tell you, it's nothing short of impressive. So, as we stand here in awe of these marvelous creatures, remember, it's the little (or should I say, not-so-little) things like their remarkable trunks that make them truly stand out. And that, my friends, wraps up the lowdown on our pachyderm pals – fascinating trunks and all!
Is a scripted video significantly different to a written blogpost? It might be a symptom of the type of YT videos I watch, but most of them seem to be essay-style "intro/thesis/points 1, 2, 3/counterpoint/conclusion", and the only thing that hints at speech is the umming-and-arring of the presenter.
"Former chief-of-staff, Mark Meadows asking a federal judge to put his surrender on hold, while deciding whether to move his trial to federal court, and former DOJ official Jeffrey Clark, seeking the same, making a pretty remarkable argument in his filing."
That's someone doing a sort of play-by-play explanation of what viewers are seeing in a video. Compare to a purposefully written story:
"A federal judge in Georgia rejected a request by former White House chief of staff Mark Meadows to postpone his surrender and arrest in Fulton County, Georgia, as an attempt to move the case to federal court is litigated, according to a court order issued Wednesday."
It seems like there could be some value in an LLM that would rewrite the first into something more like the second.
can I self-promote here? we are not doing exactly the same but we are transcribing videos ourselves (no auto YT captions) If you want to read a high quality transcript & summarize videos, you can do that at https://alphy.app
On the YouTube website, if you click the "・・・" button next to share/clip/save, there is a "Show transcript" option and you can use your browsers in-page search to search in it
There’s a website designed for language learning from watching YouTube captions with inline translations and dictionary lookup. It also has support for searching videos by subtitle content. But it has a limited index and isn’t free for all features. I thought its source was available but I can’t find it now…
https://languageplayer.io/
Searching transcripts is really something YouTube itself should be doing as part of just regular search (and fed into Google search too). I have a feeling the regular search already does it to some extent, as the system presumably is tagging videos based on its caption extraction. However, that would only apply to somewhat broad topics, not specific combinations of words and the matching text is not surfaced in the UI.
I wanted, but when I press ENTER, it asks to register... I click cancel and notice PRICING page. I click on it and again it asks for login. That is NOT how one onboards users.
The death of creative thinking is management and processes. Assembly line work doesnt permit creativity. Google is heading the same route as IBM and other once great corporations made by creative people. When the discussion shifts from ideas to processes and triviliaities it's game over. Long live whatever replaces google. It's over.
There is a term in business for this I can’t remember. Where basically your money comes from business process X and so you protect X at all costs. Which makes it very difficult to innovate away from supporting X. It’s almost impossible for a company to pivot to making their money from Y. You see this classically in things like Sears, Woolworths etc. unable to keep up with the times.
The solution to this seems to be to basically start a new company in your company completely separate from your core company more or less. Meta seems to have initially done this very well with their pivot from Facebook to Instagram. Perhaps less great in their metaverse pivot but we’ll see. Google set themselves up for success at this with alphabet but I don’t think we’ve really seen them be able to have something they feel like they can really pivot into yet so business X continues to be the focus.
Hidden away yes, but it’s amazing. It creates a totally new way of consuming informational videos and everyone here should try it. You can scroll through the transcription vertically and tap on any bit and the video instantly jumps to that point - basically the transcription is the new seek bar. And about 1000x better at the job. No more skipping back to somewhere roughly where you stopped paying attention and just letting it play for a bit until you get back into the thread - now you can jump around with surgical precision. It’s like how you can easily skim back and forth in a text article, but with a video. It’s a total game changer and I find it bizarre that it’s so hidden away so most people won’t find it. It also works particularly well when casting from your phone to a TV, using the phone as the navigator. Oh and the transcriptions are about 99% accurate, which is good enough for me.
That’s really interesting - on my own website I’ve extracted the captions to my videos and I was thinking of wiring it up so you could navigate the videos. I may actually get round to doing this now.
My friend and I made something similar a few years ago as a college hackathon project - it features automatic scene transition detection and a rough editor before publishing the final results.
(The demo site is down, but you can clone the repo and run the code locally)
This is actually potentially helpful for me as a lawyer for generating a paper record, and something I was talking about (and meaning to write up a script for) the other day. Sometimes I want to use a Youtube video in a court filing (for example, as prior art in a patent case), and submitting a rough paper record of the video like this is helpful along with the actual video.
This is interesting. I think the scenario it should be used is for non sublte messages, such as sarcasm. I gave it a try with KRAZAM's video and the answer is hilarious when you consider the video intended exactly the opposite.
> In "The Hustle," the narrator shares their jam-packed daily routine that exemplifies their dedication to productivity. From an early morning workout to late-night preparations for the next day, their schedule is filled with various activities. They efficiently manage their time, incorporating work, social media updates, and even a well-deserved happy hour. The narrator's commitment to self-improvement is also evident through their habit of reading before bed and tweeting inspiring quotes. Overall, this video highlights the narrator's hustle and structured approach to maximizing their day.
I just had to write a research report for a funding agency with many subprojects and could not get one input so I took a short video from a pitch presentation and converted it first to captions using spech2text and then to a research summary and it was really impressive.
I wonder if AI is soon going to be able to pull off the multimedia future I dreamed about as a kid, where no image is just an image, no video is just a video, no text is just text.
Cool to see some Perl in a new repo. The thing I find most appealing about Perl is their backticks for going out to the shell. I know Ruby supports it too. I understand that this can be a security issue but I find the quick shelling out so powerful for a script that also needs to do some text processing or real programming logic that the shell just isn't great for.
Watch out for portability, too. Personally I would consider it a last resort exclusively for local system automation too complex for bash but for some reason python isn't available. I'm not going to pretend I understand everyone else's use case though so I'm glad it exists.
Author here: This was indeed a quick hack. And Perl is still my fastest language for banging out a quick prototype. (I used to be the Perl 5 project lead. I’ve spent a lot of time with Perl.)
It seems "trivial" enough to detect a total shot change, and then in cases of things like fast-moving action sequences you just collapse all short shots into a single longer shot (e.g. all shots 5 seconds or less get collapsed together with any neighborhing shot 5 seconds or less).
And then pick a single frame from the exact middle, or else the most "still" frame that shows the least change from neighboring ones.
FYI this isn't AI-powered or anything. It's not even doing speech-to-text synthesis. It just uses yt-dlp (youtube-dl) to download the video with its existing [auto-generated] subtitles from YouTube.
I'm working on adding this functionality to my iOS/macOS Japanese reading app, Manabi Reader. However since my app is essentially a web browser, it adds this functionality via userscripts instead of "youtube-dl" which Apple doesn't like.
Finally the web could be searchable again. If Google is clever they should pay big bucks for this as a last resort before getting strangled by Microsofts most important investment
I can usually read 10 minutes of spoken content in less than 3 minutes. Accompanying video is often useless, unless there are specific diagrams or illustrations/photos. I don't need to see someone's face moving to absorb the info. Probably lots of people out there with similar gripes.
I didn't create the tool, so I can't claim to know the author's intentions. To me, this would be very useful in a variety of circumstances:
- the 15 minute video containing 3 minutes of useful information
- tutorials that you're trying to follow step-by-step that have been tightly edited such that actually doing each step while following along is impossible
- the video equivalent of listicles in which you're really only interested in the list, not the padding
- quickly getting an idea of whether or not it's worth your time to watch a lengthy video by quickly scrubbing through the content
Cool idea but it's still pretty minimal and hard to read. My tool YOU-TLDR allows you to search through the transcript in your language of choice side by side the video: https://www.you-tldr.com/
As someone who strongly prefers reading to watching instructional videos, I'd pay for this service :)