Would it not be hard to interleave the first frame of these videos given different starting times and angles (ignoring camera movement)? It should be easy if the videos have synchronized timestamps, but that might not always be the case.
Any in-frame motion probably allows you to align to frame after the fact. This is existing technology, and gives you timestamp to frame alignment.
If you are reconstructing sound, you can now fuzz the time alignments to give the maximum signal for the maximum time (non-correlation will damp to random noise quickly). This allows you to pairwise reconstruct time alignments.
At that point, you put them all together and run your detailed analysis.
Now, I didn't say this way EASY. :) Or cheap. Or real-time.