As an outsider: sounds like the main value lies in the AI extracting detailed an...

As an outsider: sounds like the main value lies in the AI extracting detailed and accurate (but heuristic) metadata from video: audio transcriptions, text, people, environment and objects.

Once that’s there, you can use it for organizing, searching, filtering, or whatever you want. It does not need to be coupled with an LLM-based interface.

ML models for eg face & object recognition have been deployed in both local- and cloud based photo organization for at least a decade. I very much welcome transformers to do a much better job, but I also very much reject the everything-is-a-prompt hammer as a solution to all problems. Especially in deep and professional workflows where details matter.