Good overview of all the parts involved! I was hoping they’d talk a little more about the timing aspects, and keeping audio and video in sync during playback.
What I’ve learned from working on a video editor is that “keeping a/v in sync” is… sort of a misnomer? Or anyway, it sounds very “active”, like you’d have to line up all the frames and carefully set timers to play them or something.
But in practice, the audio and video frames are interleaved in the file, and they naturally come out in order (ish - see replies). The audio plays at a known rate (like 44.1KHz) and every frame of audio and video has a “presentation timestamp”, and these timestamps (are supposed to) line up between the streams.
So you’ve got the audio and video both coming out of the file at way-faster-than-realtime (ideally), and then the syncing ends up being more like: let the audio play, and hold back the next video frame until it’s time to show it. The audio updates a “clock” as it plays (with each audio frame’s timestamp), and a separate loop watches the clock until the next video frame’s time is up.
There seems to be surprisingly little material out there on this stuff, but the most helpful I found was the “Build a video editor in 1000 lines” tutorial [0] along with this spinoff [1], in conjunction with a few hours spent poring over the ffplay.c code trying to figure out how it works.
> let the audio play, and hold back the next video frame until it’s time to show it. The audio updates a “clock” as it plays (with each audio frame’s timestamp), and a separate loop watches the clock until the next video frame’s time is up.
Yes .. but. They're interleaved within the container, but the encoder does not guarantee that they will be properly interleaved or even that they will be particularly temporally close to each other. So if you're operating in "pull" mode, as you should, then you may find that in order to find the next video frame you need to de-container (even if you don't fully decode!) a bunch of audio frames that you don't need yet, or vice versa.
The alternative is to operate in "push" mode: decode whatever frames come off the stream, audio or video, and push them into separate ring buffers for output. This is easier to write but tends to err on the side of buffering more than you need.
Interesting, I think I just dealt with this problem! I'd heard of the push/pull distinction but had interpreted it as "pull = drive the video based on the audio" and "push = some other way?". I think I saw "pull mode" referenced in the Chromium source and I had a hard time finding any definitive definition of push/pull.
What I was originally doing was "push", then: pull packets in order, decode them into frames, put them into separate audio/video ring buffers. I thought this was fine and it avoided reading the file twice, which I was happy with.
And then the other day, on some HN thread, I saw an offhand comment about how some files are muxed weird, like <all the audio><all the video> or some other pathological placement that would end up blocking one thread or another.
So I rewrote it so that the audio and video threads are independent, each reading the packets they care about and ignoring the rest. I think that's "pull" mode, then? It seems to be working fine, the code is definitely simpler, and I realized that the OS would probably be doing some intelligent caching on the file anyway.
Your mention of overbuffering reminds me, though - I still have a decent size buffer that's probably overkill now. I'll cut that back.
What I’ve learned from working on a video editor is that “keeping a/v in sync” is… sort of a misnomer? Or anyway, it sounds very “active”, like you’d have to line up all the frames and carefully set timers to play them or something.
But in practice, the audio and video frames are interleaved in the file, and they naturally come out in order (ish - see replies). The audio plays at a known rate (like 44.1KHz) and every frame of audio and video has a “presentation timestamp”, and these timestamps (are supposed to) line up between the streams.
So you’ve got the audio and video both coming out of the file at way-faster-than-realtime (ideally), and then the syncing ends up being more like: let the audio play, and hold back the next video frame until it’s time to show it. The audio updates a “clock” as it plays (with each audio frame’s timestamp), and a separate loop watches the clock until the next video frame’s time is up.
There seems to be surprisingly little material out there on this stuff, but the most helpful I found was the “Build a video editor in 1000 lines” tutorial [0] along with this spinoff [1], in conjunction with a few hours spent poring over the ffplay.c code trying to figure out how it works.
0: http://dranger.com/ffmpeg/
1: https://github.com/leandromoreira/ffmpeg-libav-tutorial