There's actually several scripts: burn the subtitles into the movie as hard-subs, extend the subtitles by 1 second, make clips of each subtitle, make headings, and combine the clips with the headings.
These are my rough notes I made at the time (you could skip the Pingtype steps if you're not trying to make bilingual language learning material).
Here's my attempt at building something for language learning since my listening skills trail so far behind my reading skills: https://www.danneu.com/slow-spanish/
Unfortunately it's really hard to generate the source material (timestamping a transcript).
So my idea was to upload some slow-speaking audio to Youtube and let it autogen its .srt subtitle files. The subtitles don't come out perfectly, but it's the timestamp data I'm after since the goal is a UI that makes it easy to replay and scrub around spoken audio.
Using YouTube to generate the timestamps is a really good idea!
I'm manually recording timestamps while I read/listen to the Bible, verse by verse. Every time I click pause in Pingtype's Media Viewer, it logs the time. It's painstaking, but I'm trying to study each verse while I read anyway, so it's good to let me pause regularly.
There's a lot of LRC data for songs that are used in KTV/Karaoke. You just need to find a good data source for Spanish. In my opinion, listening to music and singing along in church helped my Chinese much more than textbooks. I still lack confidence speaking, but my listening improved a lot when my regular playlist became majority-Chinese (I listen to iTunes all day).
I wrote a script that cuts out clips of every sentence spoken, and builds them into example sentences to learn Chinese.
https://www.youtube.com/playlist?list=PLhIooD7mFhphhT5nDdhK0...