LOL, I'll try. Two major things to grasp about transport technology are the a) "entry point" and b) the notion of time.
a) multiple types of media data are encoded independently and then bundled together in what essentially looks like an endless file (called a stream file). So when given a chunk of such a file, the decoder needs to quickly identify the nearest offset in it where it can begin decoding simultaneously all the individual media it needs. This is called "access point". Decoding cannot be started at any random place in the stream, as it generally requires context (so an access point allows to start decoding with the context being empty for all required media -- audio, video, graphics, subtitles etc). Stream file formats (called containers) are designed to solve this, provide access points to the decoder, as easily and frequently as possible.
b) a decoder, when driven by a running presentation device -- video screen, audio amplifier etc -- is essentially a pump. The encoder can be looked at like a pump too, when they are separated by network. If decoder runs faster than the source feeds it, it will drain the pipe and will make the presentation device run idle (which will be noticeable to the consumer). If it runs slower, at some point it will be drowned in data from the source. So the pumping rhythm needs to be maintained identical between both ends. The most practical way to synchronize the "piping clocksource" is via the stream file itself (which has to carry time sample data for that). Again, different containers solve this differently (some not at all).
EDIT: I didn't mention (should go into b)) the effort to make constant the throughput of the pumping -- "constant bit rate", as I believe with the advent of transport schemes which require point-to-point connections (as opposed to multicast streaming), the importance of this goes lower now.
Re: b)
For video, the encoder contains a model of the decoder, including the amount of buffering available to the decoder.
The bit-rate controller at the encoder uses this model to ensure that the decoder always has the right amount of data in its input buffers.
It also ensures that the information rate of the channel is matched with that of the compressed stream in a live transmission setting.
The transport scheme which operates at a layer below the codec, therefore only needs to take care of delay and packet delivery/loss related issues over the channel. Media is typically transmitted over UDP.
a) multiple types of media data are encoded independently and then bundled together in what essentially looks like an endless file (called a stream file). So when given a chunk of such a file, the decoder needs to quickly identify the nearest offset in it where it can begin decoding simultaneously all the individual media it needs. This is called "access point". Decoding cannot be started at any random place in the stream, as it generally requires context (so an access point allows to start decoding with the context being empty for all required media -- audio, video, graphics, subtitles etc). Stream file formats (called containers) are designed to solve this, provide access points to the decoder, as easily and frequently as possible.
b) a decoder, when driven by a running presentation device -- video screen, audio amplifier etc -- is essentially a pump. The encoder can be looked at like a pump too, when they are separated by network. If decoder runs faster than the source feeds it, it will drain the pipe and will make the presentation device run idle (which will be noticeable to the consumer). If it runs slower, at some point it will be drowned in data from the source. So the pumping rhythm needs to be maintained identical between both ends. The most practical way to synchronize the "piping clocksource" is via the stream file itself (which has to carry time sample data for that). Again, different containers solve this differently (some not at all).
EDIT: I didn't mention (should go into b)) the effort to make constant the throughput of the pumping -- "constant bit rate", as I believe with the advent of transport schemes which require point-to-point connections (as opposed to multicast streaming), the importance of this goes lower now.