Timeline and frame based editing is the end result but this more about the elemental creation and from there, editing it into time based scene.
Ive spent the last few years in the cinematography department, and most directors and director of photographers will write the scenes on flash cards and move them around and rearrange them because not every story is linear, even though we have to find a way to present every story in a linear form. And from there each scenes requires multiple angles, shots, motivations and things change so much on the fly, that a screenplay becomes a document that becomes quite dense with non-presented information.
So, I suppose the next step to this would be to parse a bunch of screenplays from different formats, into a single readable format and then train an image model on the frames of those movies we also trained the text model with screenplays on to get a cross reference of what is written down vs what is displayed visually. And we can break down the visual shots with camera movements, steadicam, dolly move etc as well as identify key props in the image model (maybe. Sounds expensive) and compare them to key props in the script. I don’t know, I’m spitballing now but a multi-modal Hollywood film producer would be kind of fun but this totally is just starting as a way to standardize the script in a granular form and to code since I’m not out on set.
You’ve given me something to think about and I think you’re right, there is an element of time missing. In fact there is a few parallel time elements missing from this too
For example, when we are shooting there is a rough formula for how long of a day we need to get those scenes. Usually it’s 1 hour per 1 page scene plus an extra 30 mins added on for each character in the scene. But that doesn’t translate to the final product as that information tells us nothing about how long or important a scene should be in the final product.
But it’s also possible I’m getting too ahead of myself here and maybe there’s another object that is created that includes the scene, production and final product objects instead of jamming it all into this object.
It's not clear what the idea is here. Which is why all the questions. I suspect, as you said, that a much more complex data structure would be required to encode all the various aspects of production into its constituent elements and the relations between the elements.
I would guess, that the first step would be to establish how the process of production (screenplay, scenes, camera angles, locations etc.) relates to the final product: scene frames. Each scene must have many shared elements as well. That would have to be encoded too.
So, I suppose the next step to this would be to parse a bunch of screenplays from different formats, into a single readable format and then train an image model on the frames of those movies we also trained the text model with screenplays on to get a cross reference of what is written down vs what is displayed visually. And we can break down the visual shots with camera movements, steadicam, dolly move etc as well as identify key props in the image model (maybe. Sounds expensive) and compare them to key props in the script. I don’t know, I’m spitballing now but a multi-modal Hollywood film producer would be kind of fun but this totally is just starting as a way to standardize the script in a granular form and to code since I’m not out on set.