Good overview of all the parts involved! I was hoping they’d talk a little more ...

pjc50 · on Jan 4, 2022

> let the audio play, and hold back the next video frame until it’s time to show it. The audio updates a “clock” as it plays (with each audio frame’s timestamp), and a separate loop watches the clock until the next video frame’s time is up.

Yes .. but. They're interleaved within the container, but the encoder does not guarantee that they will be properly interleaved or even that they will be particularly temporally close to each other. So if you're operating in "pull" mode, as you should, then you may find that in order to find the next video frame you need to de-container (even if you don't fully decode!) a bunch of audio frames that you don't need yet, or vice versa.

The alternative is to operate in "push" mode: decode whatever frames come off the stream, audio or video, and push them into separate ring buffers for output. This is easier to write but tends to err on the side of buffering more than you need.

dceddia · on Jan 4, 2022

Interesting, I think I just dealt with this problem! I'd heard of the push/pull distinction but had interpreted it as "pull = drive the video based on the audio" and "push = some other way?". I think I saw "pull mode" referenced in the Chromium source and I had a hard time finding any definitive definition of push/pull.

What I was originally doing was "push", then: pull packets in order, decode them into frames, put them into separate audio/video ring buffers. I thought this was fine and it avoided reading the file twice, which I was happy with.

And then the other day, on some HN thread, I saw an offhand comment about how some files are muxed weird, like <all the audio><all the video> or some other pathological placement that would end up blocking one thread or another.

So I rewrote it so that the audio and video threads are independent, each reading the packets they care about and ignoring the rest. I think that's "pull" mode, then? It seems to be working fine, the code is definitely simpler, and I realized that the OS would probably be doing some intelligent caching on the file anyway.

Your mention of overbuffering reminds me, though - I still have a decent size buffer that's probably overkill now. I'll cut that back.