Seems like a sensible project for them since shaving 4% of YouTube's traffic translates to millions of dollars in savings. But I'm more excited about the possibility of using deep learning image models to get incredibly higher compression rates. Some of the work I've seen on de-noising and super-resolution suggests that we are barely scratching the surface on what might be possible in terms of high-def video compression. Of course there is something of a time vs. space tradeoff, since these techniques would require way more compute for both encoding and decoding. But compute is pretty cheap and underutilized on the client side now, and Google probably has a huge amount of excess compute power to handle usage spikes that could be used for background processing.
Deep-learning-based image compression is particularly interesting because it turns the problem into a massively parallel one. As GPUs and TPUs become more powerful and more common in commodity hardware, there will be a drive to move many existing algorithms/applications to that hardware.
It's definitely cool, but the 4% win here is nothing compared to continuing AV1 rollout. I also agree that YouTube and Google in particular seem well suited to incorporate additional AI models to massively improve compression, particularly in their mobile app. A 300MB, finely tuned model (either as an AV1 upscaler or a pure deep learning model) is highly likely to be able to allow mobile devices to stream at 4k for the same bitrate as v9 480p.
I am also very interested in DL-based lossy image compression!! I am starting research in it. I believe that there's a lot of scope for DL to improve lossy image compression, since lossy relies on what humans see, and DL is really good at figuring out what humans see.
Are you also interested in it? I have found some other people interested in it. Maybe we can start a Discord channel for this?
It's really a new paradigm in compression, imo. For example with a video conference, rather than trying to compress the pixel data and put it back together accurately, you can just send pose and gaze data and recreate a "fake" video on the other side for a rediculous compression ratio.
In procedurally generated Anime there is Waifu 2x https://github.com/nagadomi/waifu2x . After procedural generation it is recommended to do denoiseing on the resulting image to improve quality.
Does anyone have any details of how they converted a multidimensional problem (size of file + user perceived quality) into a binary win/loss score? It felt like that part of the article jumped from draw an oval to draw the rest of the fucking owl real quick.
Also how they evaluated user perceived quality doesn’t seem to be elaborated. That itself is an area of active research last I looked.
While an interesting use of applied RL, in some sense isn't this just another way to cast the compression/compute tradeoff? I.e. can't we just achieve the same effects by using another compression scheme which trades off local compute for better compression?
Running MuZero online sounds like a fairly computationally expensive prospect...
I think you misunderstood the approach. MuZero is being used to optimize the choices made in the VP9 compression. In modern video encodings there's many ways to encode the same content. As a very simple example, you can vary how often you provide a full encoding of a frame and how often you encode differences between frames. Once this off-line optimization is done, the result is still a valid VP9 encoding, just a smaller one. MuZero is not needed for decompression at all.
A cloud stack, from OS kernel settings to TCP/IP to database query optimizers to video codec settings to compiler settings, is made of thousands upon thousands of toggleable options, each of which is usually left at the default because no one on earth understands more than a small fraction of them, much less how to set them all appropriately for each task end-to-end. It's blackboxes on top of blackboxes all the way down. Collectively, inferior options could be giving up an incredible amount of performance. As has been demonstrated by experts in performance tuning, depending on how pessimal the defaults are, you could easily gain orders of magnitude performance by setting them to saner settings, much less truly optimal settings - these sorts of posts turn up routinely on HN, and even in very well-tuned cloud stacks, you have to figure that gains like >10% should be possible.
MuZero here shows that it can work for one piece of the stack. And MuZero is, by design, an insanely general architecture: handles two-player games like chess/Go & handles one-player like ALE, handles continuous action spaces (Sampled-MuZero), reasonably sample-efficient (because it learns an environment model, so using that more is MuZero-Reanalyzed), handles hidden information games against adversaries (Player of Games), and now OP shows self-play in a weird setting. (It still requires problem-specific input layers but even that can be lifted if you're willing to pay for Perceiver inputs which do arbitrary input modalities.)
So you can see the potential here for doing much more of cloud operations (beyond current applications like datacenter cooling control) with DRL agents. Plunk down a MuZero on your entire stack and assign it the goal of optimizing end-to-end for each specific task - DRL is expensive, but cloud-scale is even more so. Needless to say, don't expect any released checkpoints on Github...
Hmm, I don't think I misunderstood? I get that they're using MuZero to decide the bitrate for equivalent perceptive quality as a function of the content. Sure, once they decide on that using MuZero it's a valid compression and the end-user doesn't have to do something extra... But it's super expensive to run that on the server's end, no? So it ends up being a bit like an asymmetric (in terms of client/server) compute/compression tradeoff, right? And you need to run this for each file you want to compress, hence it's "online".
> MuZero is being used to optimize the choices made in the VP9 compression.
I think "parameter optimization" is a better expression here. Optimize can means many things, but certainly it's not optimizing the algorithm/encoding itself. It's all about being smart at the very last mile.
I doubt this is intended as a real efficiency effort for YouTube, since anything that doesn't fit into their hardware-accelerated video compression framework isn't going to be economical for them.
YouTube has different needs for different videos. The average barely-viewed video can use quick and dirty hardware compression, but a large fraction of views are on a small fraction of videos, which is the sort of thing that more intense AI-assisted optimization would help with.
In that case, they could probably just trigger a re-encoding of videos when they cross into 100+ views from over 50 unique IP addresses or something like that.
Kind of confused how this work, doesn't my browser have to support this compression as well? Like watching it on deepmind's website, doesn't there have to be another layer of compression for me to watch it?
Video uses codecs which encode information in frames of various types; in particular, the I-frame is essentially a self-contained image, a bit like JPEG, and the P-frame only encodes the changes from the previous frame.
The choice of when to have an I-frame or a P-frame is arbitrary, but the rendered video will look the same. However, too many I-frames can bloat the filesize, and too few can degrade the appearance significantly as errors add up.
They act on a codec parameter related to the I-frame, to pick better rules for good compression without visible errors in the P-frame.
The decoder in your player can be considered a programmable machine, and the job of the encoder is to emit an optimal program that achieves or approximates the desired results. Just like any program compiler, there will be more than one way to do it, depending on how much space and time you are willing to dedicate to the job.
Not really sure, but I can't stop thinking that, perhaps, simpler models can perform similarly, and that this is just an attempt to find a problem that fits to the existing solution. In this specific case, the problem space is so well defined that we can just spend more time on training than building more complicated models. You see, GPUs are much cheaper to roll than researchers.
AGI consists of a set of problems:
1. Finding an algorithm which is capable of learning anything.
2. Building the computers that can run said algorithm.
3. Collecting, sorting and filtering the data that the algorithm learns from.
DeepMind claims (with good cause, IMO), that MuZero can be such an algorithm. Showing that this one algorithm can tackle disparate problems is a way of proving this.
I think the questions that still stand are: is it even possible to build computers that could drive a scaled up MuZero to AGI? And is there a more efficient way to get there? I suspect the answer to both questions is yes.
Still, I think it is pretty incredible that we've managed to build computer programs that can totally adapt to arbitrary datasets and perform arbitrary tasks.
I think they are referring to this part, second alinea: “Now, in pursuit of DeepMind’s mission to solve intelligence, MuZero has taken a first step towards mastering a real-world task by optimising video on YouTube.”
It doesn’t. We are still very far from anything approaching AGI. However the technology is impressive and are solving problems that we couldn’t solve before. So there is that.