That's a pretty specific case. You can get really good performance for a ton of tasks in video (video question answering, object identification and tracking, action recognition, etc) by just sampling a frame per second or even less frequently. Definitely can't do that with audio.