LOL, I'll try. Two major things to grasp about transport technology are the a) "...

LOL, I'll try. Two major things to grasp about transport technology are the a) "entry point" and b) the notion of time.

a) multiple types of media data are encoded independently and then bundled together in what essentially looks like an endless file (called a stream file). So when given a chunk of such a file, the decoder needs to quickly identify the nearest offset in it where it can begin decoding simultaneously all the individual media it needs. This is called "access point". Decoding cannot be started at any random place in the stream, as it generally requires context (so an access point allows to start decoding with the context being empty for all required media -- audio, video, graphics, subtitles etc). Stream file formats (called containers) are designed to solve this, provide access points to the decoder, as easily and frequently as possible.

b) a decoder, when driven by a running presentation device -- video screen, audio amplifier etc -- is essentially a pump. The encoder can be looked at like a pump too, when they are separated by network. If decoder runs faster than the source feeds it, it will drain the pipe and will make the presentation device run idle (which will be noticeable to the consumer). If it runs slower, at some point it will be drowned in data from the source. So the pumping rhythm needs to be maintained identical between both ends. The most practical way to synchronize the "piping clocksource" is via the stream file itself (which has to carry time sample data for that). Again, different containers solve this differently (some not at all).

EDIT: I didn't mention (should go into b)) the effort to make constant the throughput of the pumping -- "constant bit rate", as I believe with the advent of transport schemes which require point-to-point connections (as opposed to multicast streaming), the importance of this goes lower now.