It seems "trivial" enough to detect a total shot change, and then in cases of things like fast-moving action sequences you just collapse all short shots into a single longer shot (e.g. all shots 5 seconds or less get collapsed together with any neighborhing shot 5 seconds or less).
And then pick a single frame from the exact middle, or else the most "still" frame that shows the least change from neighboring ones.
What I’ve wanted for a long time is to use scene detection. So instead of a screenshot at regular intervals, it’s only when the visual scene changes.
Maybe if the transcript were turned into paragraphs you could have a little gallery to the side for any scene changes during that paragraph.