I wonder how much more a model would learn about subtitles from including audio AND video in training. Sure, the costs would be way bigger (parsing video even deterministically is 1.5 orders of magnitude worse than audio) but it might help with the edge cases where the speech is so unclear even the subtitle scene can't agree.