Extreme video compression with prediction using pre-trainded diffusion models

Animats · 2024-02-19T22:39:39 1708382379

Extreme compression will be when you put in a movie and get a SORA prompt back that regenerates something close enough to the movie.

aqme28 · 2024-02-20T00:33:06 1708389186

Where’s that quote? Something like “AI is just compression, and compression is indistinguishable from AI”

daveguy · 2024-02-20T01:06:40 1708391200

I'm not sure the quote, but you're probably thinking something related to the Hutter Prize:

https://en.m.wikipedia.org/wiki/Hutter_Prize

A lossless compression contest to encourage research in AI. It's lossless, I think just to standardize scoring, but I always thought a lossy version would be better for AI -- our memories are definitely lossy!

Vecr · 2024-02-20T03:21:15 1708399275

> A lossless compression contest to encourage research in AI. It's lossless, I think just to standardize scoring, but I always thought a lossy version would be better for AI -- our memories are definitely lossy!

Gwern posts about this when people say something like that on here, but I'll do it instead. Lossless encoding is just lossy encoding + error correction of some sort.

daveguy · 2024-02-20T03:36:36 1708400196

Hah. "error correction of some sort" is doing a lot of heavy lifting there. Bit level correction is what I generally consider under the "error correction" umbrella. Which we have no problem with -- curious the bit error rate that would make text unreadable. In the context of compressing a large part of the English language wiki -- I think lossy also includes loss so significant that you wouldn't be able to reproduce the exact text. So, well beyond what we would generally consider "error correcting". But intuitively understandable as equivalent by humans. Impossible to quantify that objectively, hence lossless only for the competition.

GuB-42 · 2024-02-22T13:37:55 1708609075

The way we learn exact sentences is usually by getting an intuitive sense and applying corrections.

For example, to memorize "Doggo woofs at kity", we first get the concept of "dog barks at cat", it compresses well because intuitively, we know that dogs bark and cats are common targets. That's our lossy compression and we could stop there but it is only part of the story. It is not a "dog" but a "doggo", and it goes well with the familiar tone, a good compression algorithm will take only a few bits for that. Then there is the typo "kity" vs "kitty", it will take a bit of extra space, but again, a good algorithm will recognize the common typos and compress even that. So it means the entire process to lossless matters, lossy is just stopping halfway.

And if there is pure random noise remaining, there is nothing you can do, but all algorithms are on an equal footing here. But the key is to make what the algorithm consider as uncompressible noise as small as possible.

visarga · 2024-02-20T07:04:58 1708412698

> AI is just compression, and compression is indistinguishable from AI

Almost. Compression and AI both revolve around information processing, but their core objectives diverge. Compression is focused on efficient representation, while AI is built for flexibility and the ability to navigate the unpredictable aspects of real-world data.

Compression learns a representation from the same data it encodes, like "testing on the training set". AI models have different training and test data. There are no surprises in compression.

larodi · 2024-02-20T08:47:09 1708418829

lets say AI is not-so-smart JPEG which has more parts missing, and there is more guesswork when producing restoration.

compression is most of the times about finding the minimal grammar that unfolds to the same original material.

interestingly Fabrice Bellard somehow found a way to use transformers for compression without loss, and beats xz by significant margin. https://bellard.org/nncp/nncp_v2.1.pdf. it uses "deterministic mode of PyTorch" to make sure both directions work alike which I guess means - it saves the random toss throughout compression, for the decompression to use. note: this paper is still on my to-read list.

kimixa · 2024-02-20T18:07:17 1708452437

A lot of current compression techniques use prediction followed by some set of correction data to fix mis-predictions. If the prediction is more accurate, you can have a smaller correction set.

But you're right the predictor does need to be reproducible - the output must be exactly the same to match encoder and decoder behavior. While I don't think this is big focus right now for many, I don't think there's a fundamental reason why it couldn't be, though probably at the cost of some performance.

jazzyjackson · 2024-02-20T18:26:12 1708453572

representations' efficiency increases the more flexibility you give it

pyinstallwoes · 2024-02-20T01:31:38 1708392698

Intelligence is compressing information into irreducible representation.

5kg · 2024-02-20T17:17:00 1708449420

Language Modeling Is Compression: https://arxiv.org/abs/2309.10668

sandkoan · 2024-02-20T05:53:32 1708408412

Ilya says this here: https://www.youtube.com/watch?v=AKMuA_TVz3A

Rygian · 2024-02-20T09:58:46 1708423126

How does that make sense? Compression is deterministic (for same prompt, same output is algorithmically guaranteed). AI is only deterministic in corner cases.

leoff · 2024-02-20T10:05:49 1708423549

AI is always deterministic. We add noise to the models to get "non-deterministic" results, but if the noise and input is the same, the output is also the same.

david-gpu · 2024-02-20T14:42:21 1708440141

It's a bit more nuanced than that. Floating point arithmetic is not associative: "(A+B)+C" is not always equal to "A+(B+C)". Because of that, certain mathematical operations used in neural networks, such as parallel reductions, will yield slightly different results if you run them multiple times with the same arguments.

There are some people working hard to provide the means to perform deterministic AI computations like these, but that will come with some performance losses, so I would guess that most AIs will continue to be (slightly) non-deterministic.

nuancebydefault · 2024-02-20T20:53:05 1708462385

Is AI in general relying on floating point calculations?

david-gpu · 2024-02-21T08:30:54 1708504254

One hundred percent. It's mostly fancy floating point matrix multiplications.

Rygian · 2024-02-21T12:11:44 1708517504

That assumes an AI that is trained exactly once ever.

kzrdude · 2024-02-20T07:44:53 1708415093

The compression competitions include the decompression program size in the size of the output. Must be a large series of movies compressed to win, then.

4gotunameagain · 2024-02-20T07:49:30 1708415370

If one model can "compress"/"decompress" all movies and series, its fraction of the size becomes negligible, but yes I agree with you since it still has to be distributed.

squokko · 2024-02-19T22:44:59 1708382699

I can imagine that in under 5 years, the movie's script plus one example still photo for each scene could do the job.

bsenftner · 2024-02-19T22:58:12 1708383492

If we get anywhere close to that, coming up with a new economics model is going to be the prompt we'll be giving the AGI when it's ready. We'll need it.

ben_w · 2024-02-19T22:59:27 1708383567

Ah, so you have long timelines then? :P

_the_inflator · 2024-02-20T06:05:07 1708409107

In what sense? How about reproducibility? Is stored memory, the connection between the prompt and the exact output, really compression or simply a retrieval of a compressed file then stored as factual knowledge ingrained in its Neural Network?

I like your sentiment, it is technically inspiring.

cfn · 2024-02-20T06:20:09 1708410009

Given the same prompt and the same seed (and algorithm) the resulting movie/output will always be the same. This is the case for AI image generation now.

polemic · 2024-02-20T01:50:42 1708393842

How big is the SORA model itself?

orblivion · 2024-02-20T04:17:23 1708402643

I can show you an algorithm that compresses an entire 2 hour movie to a single bit, but it only works on one movie.

Leo_Germond · 2024-02-20T05:20:05 1708406405

Ah I got it to work for two movies :)

verticalscaler · 2024-02-20T06:15:19 1708409719

You just made disco stu very happy, if these trends continue- 'Ey!

nuancebydefault · 2024-02-20T20:55:48 1708462548

I think the most impressive movie compression technique is youtube. It compresses movies into a short sequence of characters denoted as URLs.

jasonjmcghee · 2024-02-20T04:08:20 1708402100

You only need one copy of it - even if it is 100GB. If it's baked into every OS... and storage / RAM keeps getting cheaper... just might work.

polemic · 2024-02-20T08:31:18 1708417878

Meh, AI doesn't break information theory. The relationship between the prompt size to the "similarity" of the result will be such that it doesn't beat traditional compression techniques.

At best we might consider it a new type of lossy (or... replacey?) compression. Of course if storage / RAM / bandwidth keeps increasing, this is quite likely the least energy efficient technique available.

snovv_crash · 2024-02-20T09:36:56 1708421816

If the compression can take into account the sub-manifold of potential outputs that people would actually be interested in watching a movie about it can achieve enormously higher compression than if it doesn't know about this.

polemic · 2024-02-20T20:27:39 1708460859

Unproven - but yes like I say - a new type of compression

snovv_crash · 2024-02-22T11:55:52 1708602952

Not new. If you have a preshared dictionary even zstd can give you better compression.

est · 2024-02-20T03:52:45 1708401165

it depends which resolution you want to see.

jebarker · 2024-02-20T14:22:49 1708438969

This isn't really any different to what they've done is it?

IgorPartola · 2024-02-19T23:19:09 1708384749

“Alexa show me Star Wars but with Dustin Hoffman as Luke”.

gedy · 2024-02-20T01:35:49 1708392949

I actually would really like this flexibility. "Star Wars, but in Korean with k-pop stars cast", etc.

solardev · 2024-02-20T02:12:35 1708395155

Or as a VR game. "Star Wars, but with the Empire and Rebel Alliance teaming up to defeat the latest threat to the galaxy: me as Jar-Jar Binks, Jedi Jester. My abilities include Force Juggle, Failed Comedic Relief, and Turn Into Merchandise. Oh, and Darth Vader is Morgan Freeman, and everyone else is Natalie Portman."

yreg · 2024-02-20T09:52:46 1708422766

I've invited friends over to watch Harry Potter and the Deathly Weapons (a.k.a. Harry Potter with Guns) in its entirety and it was entertaining.

https://harrypotterwithguns.com/

trailer: https://www.youtube.com/watch?v=xA-ayM5I4Jw

corobo · 2024-02-20T10:04:46 1708423486

The most fun thinking is combining this with that Apple Vision and further steps in that tech tree

Gonna be able to skip holodeck episodes in real life soon

EGreg · 2024-02-19T23:21:27 1708384887

“I’m sorry Dave. I can’t do that. As an Amazon Large Langauge model, I need you to up your subscription to Amazon Prime first.”

“On the other hand, I can generate endless amounts of Harlan Coben miniseries… :-P”

gedy · 2024-02-20T01:33:54 1708392834

"To watch Dustin Hoffman, you need to subscribe to the Classic Stars pack"

melagonster · 2024-02-20T01:29:24 1708392564

the old style selective copyright infringement.

gcanyon · 2024-02-20T05:59:07 1708408747

Ha, I've commented almost exactly this twice now on HN. We'll see how long before it's a reality -- probably better measured in months rather than years.

ToJans · 2024-02-20T00:25:51 1708388751

Ahhh, Sloot's digital coding system [1] is finally here ;).

[1] https://en.m.wikipedia.org/wiki/Sloot_Digital_Coding_System

CamperBob2 · 2024-02-20T01:52:44 1708393964

In the [Sloot Digital Coding System], it is claimed that no movies are stored, only basic building blocks of movies, such as colours and sounds. So, when a number is presented to the SDCS, it uses the number to fetch colours and sounds, and constructs a movie out of them. Any movie. No two different movies can have the same number, otherwise they would be the same movie. Every possible movie gets its own unique number. Therefore, I should be able to generate any possible movie by loading some unique number in the SDCS.

Guy named Borges already patented that, I'm afraid.

numlock86 · 2024-02-20T08:03:53 1708416233

You just need an index and length within Pi's digits, duh.

didntcheck · 2024-02-20T10:42:22 1708425742

It sounds almost like someone explained content-addressed storage to him and he misunderstood (where you can uniquely identify a movie by number, down to some hopefully a negligible collision likelihood, but you're merely indexing known data)

userbinator · 2024-02-19T23:29:52 1708385392

How fast is this and how big is the decoder/encoder? The model weights are not accessible.

From the description, it looks like it's only being tested with 128x128 frames, which implies that the speed is very low.

newaccount7g · 2024-02-20T00:12:56 1708387976

Why would you expect those kind of details in a paid commercial?

resolutebat · 2024-02-20T03:47:39 1708400859

It's a link to a Github repo, not a "paid commercial".

IshKebab · 2024-02-19T22:34:34 1708382074

> It can be observed that our model outperforms them at low bitrates

It can? Maybe I'm misunderstanding the graphs but it doesn't look like it to me?

astrange · 2024-02-19T22:53:27 1708383207

Graphs (especially PSNR) aren't a good way to judge video compression. It's better to just watch the video.

Many older/commercial video codecs optimized for PSNR, which results in the output being blurry and textureless because that's the best way to minimize rate for the same PSNR.

userbinator · 2024-02-19T23:25:19 1708385119

Many older/commercial video codecs optimized for PSNR, which results in the output being blurry and textureless because that's the best way to minimize rate for the same PSNR.

Even with that, showing H.265 having lower PSNR than H.264 is odd --- it's the former which has often looked blurrier to me.

adgjlsfhk1 · 2024-02-20T00:27:38 1708388858

at equal bitrate?

kookamamie · 2024-02-20T07:52:15 1708415535

At equal bitrate H.265 typically is considered twice as efficient as H.264. The graphs look all wrong to me - they show "ours" at a lower PNSR compared to both H.264 and H.265.

ThisIsMyAltAcct · 2024-02-19T23:59:14 1708387154

Someone should train a model to evaluate video compression quality

cedricd · 2024-02-20T01:05:32 1708391132

Netflix did VMAF for this: https://github.com/Netflix/vmaf

It checks a reference video against an encoded video and returns a score representing how close the encoded video appears to the original from a human perspective.

adgjlsfhk1 · 2024-02-20T04:54:29 1708404869

that said, iiuc, SSIMULACRA 2.1 is generally considered a strictly better quality measurement.

cstejerean · 2024-02-20T05:20:56 1708406456

Citation needed.

adgjlsfhk1 · 2024-02-20T05:43:36 1708407816

https://github.com/cloudinary/ssimulacra2?tab=readme-ov-file... shows a higher correlation with human responses across 4 different datasets and correlation metrics for one.

also see https://jon-cld.s3.amazonaws.com/test/ahall_of_fshame_SSIMUL... which is an ab comparison of a lot of images where it gives 2 versions, one preferred by ssimulacra, the other preferred by vmaf

cstejerean · 2024-02-22T19:28:57 1708630137

The authors of the metric found some cases where it works better is not the same thing as it being widely considered to be better. When it comes to typical video compression and scaling artifacts VMAF does really well. To prove something is better than VMAF on video compression it should be compared on datasets like MCL-V, BVI-HD, CC-HD, CC-HDDO, SHVC, IVP, VQEGHD3 and so on (and of course Netflix Public).

TID2013 for example is an image dataset with many artifacts completely unrelated to compression and scaling.

- Additive Gaussian noise - Additive noise in color components is more intensive than additive noise in the luminance component - Spatially correlated noise - Masked noise - High frequency noise - Impulse noise - Quantization noise - Gaussian blur - Image denoising - JPEG compression - JPEG2000 compression - JPEG transmission errors - JPEG2000 transmission errors - Non eccentricity pattern noise - Local block-wise distortions of different intensity - Mean shift (intensity shift) - Contrast change - Change of color saturation - Multiplicative Gaussian noise - Comfort noise - Lossy compression of noisy images - Image color quantization with dither - Chromatic aberrations - Sparse sampling and reconstruction

Doing better on TID2013 is not really an indication of doing better on a video compression and scaling dataset (or being more useful for making decisions for video compression and streaming).

astrange · 2024-02-21T04:06:49 1708488409

That wouldn't work forever due to Goodhart's law.

holoduke · 2024-02-20T06:55:03 1708412103

Back in 2005 there was a collegue at my first job writing video format converters software. He was considered a genius and the stereo type of an introvert software developer. He claimed that one day an entire movie could be compressesed on a single floppydisk. Everybody laughed and thought he was weird. He might be right after all.

HarHarVeryFunny · 2024-02-20T13:47:34 1708436854

Well, as a reality check, even the soundtrack of a 1hr movie would be 50x floppy size (~50MB vs 1MB) if MP3 compressed.

I guess where this sort of generative video "compression" is headed is that the video would be the prompt, and you'd need a 100GB decoder (model) to render it.

No doubt one could fit a prompt to generate a movie similar to something specific in a floppy size ("dude gets stuck on mars, grows potatoes in his own shit"). However, 1MB is only enough to hold the words of a book, and one could imagine 100's of movie adaptations (i.e. visualizing the "prompt") of any given book that would all be radically different, so it seems a prompt of this size would only be enough to generate one of these "prompt movie adaptations".

fasa99 · 2024-02-20T07:39:09 1708414749

I used to work with a guy like that in 1997, during the bubble, Higgins was his name. He'd claim you could fit every movie ever onto a CD-ROM, at least one day in the future it would be possible. Higgins was weird. I can still recall old Higgins getting out every morning and nailing a fresh load of tadpoles to that old board of his. Then he'd spin it round and round, like a wheel of fortune, and no matter where it stopped he'd yell out, "Tadpoles! Tadpoles is a winner!" We all thought he was crazy but then we had some growing up to do.

djmips · 2024-02-20T10:44:20 1708425860

A deep thought.

resolutebat · 2024-02-20T03:55:11 1708401311

Here's the research behind this: https://arxiv.org/html/2402.08934v1

As a casual non-scholar, non-AI person trying to parse this though, it's infuriatingly convoluted. I was expecting a table of "given source file X, we got file size Y with quality loss Z", but while quality (SSIM/LPIPS) is compared to standard codecs like H.264, for the life of me I can't find any measure of how efficient the compression is here.

Applying AI to image compression has been tried before though, with distinctly mediocre results: some may recall the Xerox debacle about 10 years, when it turned out copiers were helpfully "optimizing" images by replacing digits with others in invoices, architectural drawings, etc.

https://www.theverge.com/2013/8/6/4594482/xerox-copiers-rand...

lifthrasiir · 2024-02-20T04:10:32 1708402232

> [S]ome may recall the Xerox debacle about 10 years, when it turned out copiers were helpfully "optimizing" images by replacing digits with others in invoices, architectural drawings, etc.

This is not even AI. JBIG2 allows a reuse of once-decoded image patches because it's quite reasonable for bi-level images like fax documents. It is true that similar glyphs may be incorrectly groupped into the same patch, but such error is not specific to patch-based compression methods (quantization can often lead to the same result). The actual culprit was Xerox's bad implementation of JBIG2 that incorrectly merged too many glyphs into the same patch.

daemonologist · 2024-02-20T04:19:29 1708402769

I believe they're using "bpp" (bits per pixel) to indicate compression efficiency, and in the section about quality they're holding it constant at 0.06 bpp. The charts a bit further down give quality metrics as a function of compression level (however, they seem to indicate that h.264 is outperforming h.265 in their tests which would be surprising to me).

sdenton4 · 2024-02-20T07:29:04 1708414144

It turns out that compression, especially for media platform, is trading off file size, quality, and compute. (And typically we care more about compute for decoding.) This is hard to represent in a two dimensional chart.

Furthermore, it's pretty common in compression research to focus on the size/quality trade-off, and leave optimization of compute for real-world implementations.

sbalamurugan · 2024-02-19T22:55:41 1708383341

It’s uncanny how much of the current stuff has been predicted by the sitcom -“Silicon Valley”

iwontberude · 2024-02-19T23:04:55 1708383895

Yeah curious to hear what the Weissman score of this latest algorithm is going to be.

xxs · 2024-02-20T07:03:31 1708412611

that prize goes to zstandard, though.

LeoPanthera · 2024-02-20T05:24:13 1708406653

It's important to remember that any compression gains must include the size of the decompressor which, I assume, will include an enormous diffusion model.

ec109685 · 2024-02-20T05:26:27 1708406787

Can’t that be amortized across all videos (e.g. if YouTube had a decompressor they downloaded once)?

LeoPanthera · 2024-02-20T05:29:13 1708406953

Yes, absolutely, it's just important to keep in mind when thinking of these decompressors as "magic". If every laptop shipped with a copy of Wikipedia, then you could compress Wikipedia, and any text that looks similar to Wikipedia, really well.

smerik · 2024-02-20T11:19:29 1708427969

Does anyone remember the https://en.wikipedia.org/wiki/Sloot_Digital_Coding_System?

zaptrem · 2024-02-19T21:28:13 1708378093

Can you share example videos?

yonixw · 2024-02-19T21:35:59 1708378559

Googling gave me the article: https://www.arxiv.org/abs/2402.08934

Which have examples in it.

resolutebat · 2024-02-20T03:49:27 1708400967

Direct link to HTML article: https://arxiv.org/html/2402.08934v1

Unfortunately it only contains still images with teeny thumbnails: https://arxiv.org/html/2402.08934v1/x2.png

az226 · 2024-02-20T01:28:02 1708392482

Images are hard to evaluate the quality of a video compression. Because it’s diffusing, will it have a bunch of diffuse-jitter.

hulitu · 2024-02-20T12:00:51 1708430451

> Extreme video compression with prediction using pre-trainded diffusion models

Is this more extreme than youtube ?

mjevans · 2024-02-19T22:23:52 1708381432

I wonder how effective a speed focused variation could be for quality among 264, 265, and AV1.

hoseja · 2024-02-20T07:37:25 1708414645

Middle-out.