Error rates are still high. You quickly find that if you have latency sensitive information that you need to ship over the public internet, like live video.
Also, you can trade off bandwidth for error rate at a lower level of the networking stack, so if you know your applications will use error- correcting protocols like TCP, you can make your switches and routers talk to each other a little faster.
protocols like quic and srt (used for video) are great, forward error/erasure correction is something I would also mention as a large part of the rise of UDP over TCP based transfer/protocols.
If your video is short enough to encode in the lambda limit it is worth considering that https://aws.amazon.com/blogs/media/processing-user-generated...
mediaconvert is expensive, generally AWS video stack is expensive (MediaConvert, Elemental, Media Connect)
For what I have down (2k and down, main) the quality has been OK. I have also read some complaints about quality at high res, but I am a happy customer
Same in video parsers and tooling frequently, expects a whole mp4 to be there, or a whole video to parse it, yet gstreamer/ffmpegapi delivers the content as a stream of buffers that you have to process one buffer at a time.
Traditionally, ffmpeg would build the mp4 container while transcoded media is written to disk (in a single contiguous mdat box after ftyp) and then put the track description and samples in a moov at the end of the file. That's efficient because you can't precisely allocate the moov before you've processed the media (in one pass).
But when you would load the file into a <video> element, it would off course need to buffer the entire file to find the moov box needed to decode the the NAL units (in case of avc1).
A simple solution was then to repackage by simply moving the moov at the end of the file before the mdat (adjusting chunk offset). Back in the day, that would make your video start instantly!
This is basically what cmaf is. the moov and ftyp gets sent at the beginning (and frequently gets written as an init segment) and then the rest of the stream is a continuous stream of moof's and mdat's chunked as per gstreamer/ffmpeg specifics.
I was thinking progressive MP4, with sample table in the moov. But yes, cmaf and other fragmented MP4 profiles have ftyp and moov at the front, too.
Rather than putting the media in a contiguous blob, CMAF interleaves it with moofs that hold the sample byte ranges and timing. Moreover, while this interleaving allows most of the CMAF file to be progressively streamed to disk as the media is created, it has the same CATCH22 problem as the "progressive" MP4 file in that the index (sidx, in case of CMAF) cannot be written at the start of the file unless all the media it indexes has been processed.
When writing CMAF, ffmpeg will usually omit the segment index which makes fast search painful. To insert the `sidx` (after ftyp+moov but before the moof+mdat s) you need to repackage (but not re-encode).
It is possible that this is not a fault of the parser or tooling. In some cases, specifically when the video file is not targeted for streaming, the moov atom is at the end of the mp4. The moov atom is required for playback.
That's intentional, and it can be very handy. Zip files were designed so that you make an archive self-extracting. They made it so that you could strap a self-extraction binary to the front of the archive, which - rather obviously - could never have been done if the executable code followed the archive.
But the thing is that the executable can be anything, so if what you want to do is to bundle an arbitrary application plus all its resources into a single file, all you need to do is zip up the resources and append the zipfile to the compiled executable. Then at runtime the application opens its own $0 as a zipfile. It Just Works.
Also, it makes it easier to append new files to an existing zip archive. No need to adjust an existing header (and potentially slide the whole archive around if the header size changes), just append the data and append a new footer.
I’ve found the Rust ecosystem to be very good about never assuming you have enough memory for anything and usually supporting streaming styles of widget use where possible.
ha! I was literally thinking of the libs for parsing h264/5 and mp4 in rust (so not using unsafe gstreaer/ffmpeg code) when moaning a little here.
Generally i find the rust libraries and crates to be well designed around readers and writers.
My experience that played out over the last few weeks lead me to a similar belief, somewhat. For rather uninteresting reasons I decided I wanted to create mp4 videos of an animation programmatically.
The first solution suggested when googling around is to just create all the frames, save them to disk, and then let ffmpeg do its thing from there. I would have just gone with that for a one-off task, but it's a pretty bad solution if the video is long, or high res, or both. Plus, what I really wanted was to build something more "scalable/flexible".
Maybe I didn't know the right keywords to search for, but there really didn't seem to be many options for creating frames, piping them straight to an encoder, and writing just the final video file to disk. The only one I found that seemed like it could maybe do it the way I had in mind was VidGear[1] (Python). I had figured that with the popularity of streaming, and video in general on the web, there would be so much more tooling for these sorts of things.
I ended up digging way deeper into this than I had intended, and built myself something on top of Membrane[2] (Elixir)
It sounds like a misunderstanding of the MPEG concept. For an encode to be made efficiently, it needs to see more than one frame of video at a time. Sure, I-frame only encoding is possible, but it's not efficient and the result isn't really distributable. Encoding wants to see multiple frames at a time so that the P and B frames can be used. Also, to get the best bang for the bandwidth buck is to use multipass encoding. Can't do that if all of the frames don't exist yet.
You have to remember how old the technology you are trying to use is, and then consider the power of the computers available when they were made. MPEG-2 encoding used to require a dedicated expansion card because the CPUs did have decent instructions for the encoding. Now, that's all native to the CPU which makes the code base archaic.
No doubt that my limited understanding of these technologies came with some naive expectations of what's possible and how it should work.
Looking into it, and working through it, part of my experience was a lack of resources at the level of abstraction that I was trying to work in. It felt like I was missing something, with video editors that power billion dollar industries on one end, directly embedding ffmpeg libs into your project and doing things in a way that requires full understanding of all the parts and how they fit together on the other end, and little to nothing in-between.
Putting a glorified powerpoint in an mp4 to distribute doesn't feel to me like it is the kind of task where the prerequisite knowledge includes what the difference between yuv420 and yuv422 is or what Annex B or AVC are.
My initial expectation was that there has to be some in-between solution. Before I set out, what I had thought would happen is that I `npm install` some module and then just create frames with node-canvas, stream them into this lib and get an mp4 out the other end that I can send to disk or S3 as I please.* Worrying about the nitty gritty details like how efficient it is, many frames it buffers, or how optimized the output is, would come later.
Going through this whole thing, I now wonder how Instagram/TikTok/Telegram and co. handle the initial rendering of their video stories/reels, because I doubt it's anywhere close to the process I ended up with.
* That's roughly how my setup works now, just not in JS. I'm sure it could be another 10x faster at least, if done differently, but for now it works and lets me continue with what I was trying to do in the first place.
This sounds like "I don't know what a wheel is, but if I chisel this square to be more efficient it might work". Sometimes, it's better to not reinvent the wheel, but just use the wheel.
Pretty much everyone serving video uses DASH or HLS so that there are many versions of the encoding at different bit rates, frame sizes, and audio settings. The player determines if it can play the streams and keeps stepping down until it finds one it can use.
Edit:
>Putting a glorified powerpoint in an mp4 to distribute doesn't feel to me like it is the kind of task where the prerequisite knowledge includes what the difference between yuv420 and yuv422 is or what Annex B or AVC are.
This is the beauty of using mature software. You don't need to know this any more. Encoders can now set the profile/level and bit depth to what is appropriate. I don't have the charts memorized for when to use what profile at what level. In the early days, the decoders were so immature that you absolutely needed to know the decoder's abilities to ensure a compatible encode was made. Now, the decoder is so mature and is even native to the CPU, that the only limitation is bandwidth.
Of course, all of this is strictly talking about the video/audio. Most people are totally unawares that you can put programming inside of an MP4 container that allows for interaction similar to DVD menus to jump to different videos, select different audio tracks, etc.
> This sounds like "I don't know what a wheel is, but if I chisel this square to be more efficient it might work". Sometimes, it's better to not reinvent the wheel, but just use the wheel.
I'm not sure I can follow. This isn't specific to MP4 as far as I can tell. MP4 is what I cared about, because it's specific to my use case, but it wasn't the source of my woes. If my target had been a more adaptive or streaming friendly format, the problem would have still been to get there at all. Getting raw, code-generated bitmaps into the pipeline was the tricky part I did not find a straightforward solution for. As far as I am able to tell, settling on a different format would have left me in the exact same problem space in that regard.
The need to convert my raw bitmap from rgba to yuv420 among other things (and figuring that out first) was an implementation detail that came with the stack I chose. My surprise lies only in the fact that this was the best option I could come up with, and a simpler solution like I described (that isn't using ffmpeg-cli, manually or via spawning a process from code) wasn't readily available.
> You don't need to know this any more.
To get to the point where an encoder could take over, pick a profile, and take care of the rest was the tricky part that required me to learn what these terms meant in the first place. If you have any suggestions of how I could have gone about this in a simpler way, I would be more than happy to learn more.
using the example of ffmpeg, you can use things like -f in front of -i to describe what the incoming format is so that your homebrew exporting can send to stdout piped to ffmpeg where reads from stdin with '-i -' but more specifically '-f bmp -i -' would expect the incoming data stream to be in the BMP format. you can select any format for the codecs installed 'ffmpeg -codecs'
In a way, that's good. The few hundred video encoding specialists who exist in the world have, per person, had a huge impact on the world.
Compare that to web developers, who in total have had probably a larger impact on the world, but per head it is far lower.
Part of engineering is to use the fewest people possible to have the biggest benefit for the most people. Video did that well - I suspect partly by being 'hard'.
My office is 7 minutes on my bicycle. I work from home where I have a better screen setup, privacy, instead of having to shuffle into a cubicle to make calls. I cycle to the office to have a beer with some colleagues every few weeks on an agreed date and we meet there when we have to do some deep planning. Day to day my house is so much better
why would you want 1? having the raster frames is surely better for post production. I agree that models should take a stab at compression but I think it should be independent. At the end of the day you also don't want to be doing video compression on your GPU, using a dedicated chip for that is so much more efficient.
lastly, you don't want to compress the same all the time. FOr low latency we compress with no b-frames and a smallish GOP, with VOD we have a long GOP and b-frames are great for compression.
2. as long as we can again port the algo's to dedicated hardware, which are on mobiles a must for energy efficiency for both encode and decode
Where do we price up externalities! We surely need to in manufacturing, construction and such, but I don't see us doing it in any meaningful way now, so I am a bit surprised by your statement!