Hacker News new | past | comments | ask | show | jobs | submit login
Compressing Images with Neural Networks (mlumiste.com)
171 points by skandium 8 months ago | hide | past | favorite | 69 comments



How badly will its lossy-ness change critical things? In 2013, there were Xerox copiers with aggressive compression that changed numbers, https://www.theregister.com/2013/08/06/xerox_copier_flaw_mea...


If I zoom all the way with my iPhone, the camera-assisting intelligence will mess up numbers too


The mentioned Xerox copier incident was not an OCR failure, but the copier actively changed the numbers in the original image due to its image compression algorithm.


Here's some of the context: www.dkriesel.com/blog/2013/0810_xerox_investigating_latest_mangling_test_findings

Learn More: https://www.dkriesel.com/start?do=search&id=en%3Aperson&q=Xe...

Brief: Xerox machines used template matching to recycle the scanned images of individual digits that recur in the document. In 2013, Kriesel discovered this procedure was faulty.

Rationale: This method can create smaller PDFs, advantageous for customers that scan and archive numerical documents.

Prior art: https://link.springer.com/chapter/10.1007/3-540-19036-8_22

Tech Problem: Xerox's template matching procedure was not reliable, sometimes "papering over" a digit with the wrong digit!

PR Problem: Xerox press releases initially claimed this issue did not happen in the factory default mode. Kriesel demonstrated this was not true, by replicating the issue in all of the factory default compression modes including the "normal" mode. He gave a 2015 FrOSCon talk, "Lies, damned lies and scans".

Interesting work!


Any lossy compressor changes the original image for better compression at expense of the perfect accuracy.


Exactly, in practice the alternatives are either blocky artifacts (JPEG and most other traditional codecs), blurring everything (learned codecs optimised for MSE) or "hallucinating" patterns when using models like GANs. However, in practice even the generative side of compression models is evaluated against the original image rather than only output quality, so the outputs tend to be passable.

To see what a lossy generator hallucinating patterns means in practice, I recommend viewing HiFiC vs original here: https://hific.github.io/


Tradtional lossy compressors have well-understood artifacts. In particular they provide guarantees such that you can confidently say that an object in the image could not be an artifact.


The word perfect is misplaced, the trade off is size vs fidelity (aka accuracy)


Lossy compression has the same problem it has always had: lossy metadata.

The contextual information surrounding intentional data loss needs to be preserved. Without that context, we become ignorant of the missing data. Worst case, you get replaced numbers. Average case, you get lossy->lossy transcodes, which is why we end up with degraded content.

There are only two places to put that contextual information: metadata and watermarks. Metadata can be written to a file, but there is no guarantee it will be copied with that data. Watermarks fundamentally degrade the content once, and may not be preserved in derivative works.

I wish that the generative model explosion would result in a better culture of metadata preservation. Unfortunately, it looks like the focus is on metadata instead.


The suitable lossy-ness (of any compression method) is entirely dependant on context. There is no one size fits all approach for all uses cases.

One key item with emerging 'AI compression' techniques is the information loss is not deterministic which somewhat complicates assessing suitability.


> the information loss is not deterministic

It is technically possible to make it deterministic.

The main reason you don't deterministic outputs today is that Cuda/GPU optimizations make the calculations run much faster if you let them be undeterministic.

The internal GPU scheduler will then process things in the order it thinks is fastest.

Since floating point is not associative, you can get different results for (a + (b + c)) and ((a + b) + c).


The challenge goes beyond rounding errors.

Many core codecs are pretty good at adhering to reference implementations, but are still open to similar issues so may not be bit exact.

With a DCT or wavelet transform, quantisation, chroma subsampling, entropy coding, motion prediction and the suite of other techniques that go into modern media squishing it’s possible to mostly reason about what type of error will come out the other end of the system for a yet to be seen input.

When that system is replaced by a non-linear box of mystery, this ability is lost.


That was interesting (info in your link)


This JBIG2 "myth" is too widespread. It is true that Xerox's algorithm mangled some numbers in its JBIG2 output, but it is not an inherent flaw of JBIG2 to start, and Xerox's encoder misbehaved almost exclusively for lower dpis---300dpi or more was barely affected. Other artifacts at lower resolution can exhibit similar mangling as well (specifics would of course vary), and this or similar incident wasn't repeated so far. So I don't feel it is even a worthy concern at this point.


1. No one, at least not OP, ever said it's a inherent flaw of JBIG2. The fact it's an implementation error on XeroX's end is a good technical detail to know, but it is irrelevant to the topic.

2. "Lower DPI" is extremely common if your definition for that is 300dpi. At my company, all the text document are scanned at 200dpi by default. And 150dpi or even lower is perfectly readable if you don't use ridiculous compression ratios.

> Other artifacts at lower resolution can exhibit similar mangling as well (specifics would of course vary)

Majority of traditional compressions would make text unreadable when compression is too high or the source material is too low-resolution. They don't substitute one number for another in an "unambiguous" way (i.e. it clearly shows a wrong number instead of just a blurry blob that could be both).

The "specifics" here is exactly what the whole topic is focus on, so you can't really gloss over it.


> 1. No one, at least not OP, ever said it's a inherent flaw of JBIG2. The fact it's an implementation error on XeroX's end is a good technical detail to know, but it is irrelevant to the topic.

It is relevant only when you assume that lossy compression has no way to control or even know of such critical changes. In reality most lossy compression algorithms use a rate-distortion optimization, which is only possible when you have some idea about "distortion" in the first place. Given that the error rarely occurred in higher dpis, its cause should have been either a miscalculation of distortion or a misconfiguration of distortion thresholds for patching.

In any case, a correct implementation should be able to do the correct thing. It would have been much problematic if similar cases were repeated, since it would mean that it is much harder to write a correct implementation than expected, but that didn't happen.

> Majority of traditional compressions would make text unreadable when compression is too high or the source material is too low-resolution. They don't substitute one number for another in an "unambiguous" way (i.e. it clearly shows a wrong number instead of just a blurry blob that could be both).

Traditional compressions simply didn't have much computational power to do so. The "blurry blob" is something with lower-frequency components only by definition, and you have only a small number of them, so they were easier to preserve even with limited resources. But if you have and recognize a similar enough pattern, it should be exploited for further compression. Motion compensation in video codecs were already doing a similar thing, and either a filtering or intelligent quantization that preserves higher-frequency components would be able to do so too.

----

> 2. "Lower DPI" is extremely common if your definition for that is 300dpi. At my company, all the text document are scanned at 200dpi by default. And 150dpi or even lower is perfectly readable if you don't use ridiculous compression ratios.

I admit I have generalized too much, but the choice of scan resolution is highly specific to contents, font sizes and even writing systems. If you and your company can cope with lower DPIs, that's good for you, but I believe 300 dpi is indeed the safe minimum.


There was an earlier article (Sep 20, 2022) about using the Stable Diffusion VAE to perform image compression. Uses the VAE to change from pixel space to latent space, dithers the latent space down to 256 colors, then when it's time to decompress it, it de-noises that.

https://pub.towardsai.net/stable-diffusion-based-image-compr...

HN discussion: https://news.ycombinator.com/item?id=32907494


I've done a bunch of experiments on my own on the Stable Diffusion VAE.

Even when going down to 4-6 bits per latent space pixel the results are surprisingly good.

It's also interesting what happens if you ablate individual channels; ablating channel 0 results in faithful color but shitty edges, ablating channel 2 results in shitty color but good edges, etc.

The one thing it fails catastrophically on though is small text in images. The Stable Diffusion VAE is not designed to represent text faithfully. (It's possible to train a VAE that does slightly better at this, though.)


How does the type of image (Anime, vs Photo realistic, vs Painting vs etc .m) affect the compression results? Is there a noticable difference?


I haven't noticed much difference between these. They're all well-represented in the VAE training set.


Something similar by Fabrice Bellard:

https://bellard.org/nncp/


If you look at the winners of the Hutter prize, or especially the Large Text Compression Benchmark, then almost every approach uses some kind of machine learning approach for the adaptive probability model and then either arithmetic coding or rANS to losslessly encode it.

This is intuitive, as the competition organisers say: compression is prediction.


Some people are fans of Metallica or Taylor Swift. I think Fabrice Bellard should get the same attention!


And the same money for performance, of course


A first NN based image compression standard is currently being developed by JPEG. More information can be found here: https://jpeg.org/jpegai/documentation.html

Best overview you can probably get from “JPEG AI Overview Slides”


Anyone know of open models useful (and good quality) for going the other way? I.e., Input is a 800x600 jpg and output is 4k version.


Magnific.ai (https://magnific.ai) is a paid tool that works well, but it is expensive.

However, this weekend someone released an open-source version which has a similar output. (https://replicate.com/philipp1337x/clarity-upscaler)

I'd recommend trying it. It takes a few tries to get the correct input parameters, and I've noticed anything approaching 4× scale tends to add unwanted hallucinations.

For example, I had a picture of a bear I made with Midjourney. At a scale of 2×, it looked great. At a scale of 4×, it adds bear faces into the fur. It also tends to turn human faces into completely different people if they start too small.

When it works, though, it really works. The detail it adds can be incredibly realistic.

Example bear images:

1. The original from Midjourney: https://i.imgur.com/HNlofCw.jpeg

2. Upscaled 2×: https://i.imgur.com/wvcG6j3.jpeg

3. Upscaled 4×: https://i.imgur.com/Et9Gfgj.jpeg

----------

The same person also released a lower-level version with more parameters to tinker with. (https://replicate.com/philipp1337x/multidiffusion-upscaler)


That magnific.ai thingy is taking a lot of liberty on the images, and denaturing it.

Their example with the cake is the most obvious. To me, the original image shows a delicious cake, and the modified one shows a cake that I would rather not eat...


Every single one of their before & after photos looks worse in the after.

The cartoons & illustrations lose all of their gradations in feeling & tone with every outline a harsh edge. The landscapes lose any sense of lushness and atmosphere, instead taking a high-clarity HDR look. Faces have blemishes inserted the original actor never had. Fruit is replaced with wax imitation.

As an artist, I would never run any of my art through anything like this.


Here's free and open source alternative that works pretty well

https://www.upscayl.org/


Both of these links to replicate 404 for me



Look for SuperResolution. These models will typically come as a GAN, Normalizing Flow (or Score, NODE), or more recently Diffusion (or SNODE) (or some combination!). The one you want will depend on your computational resources, how lossy you are willing to be, and your image domain (if you're unwilling to tune). Real time (>60fps) is typically going to be a GAN or flow.

Make sure to test the models before you deploy. Nothing will be lossless doing superresolution but flows can get you lossless in compression.



I haven't explored the current SOTA recently, but super-resolution has been pretty good for a lot of tasks for few years at least. Probably just start with hugging-face [0] and try a few out, especially diffusion-based models.

[0] https://huggingface.co/docs/diffusers/api/pipelines/stable_d...


Current SOTA open source is I believe SUPIR (Example - https://replicate.com/p/okgiybdbnlcpu23suvqq6lufze), but it needs a lot of VRAM, or you can run it through replicate, or here's the repo (https://github.com/Fanghua-Yu/SUPIR)


You’re looking for what’s called upscaling, like with Stable Diffusion: https://huggingface.co/stabilityai/stable-diffusion-x4-upsca...


There are a bunch of great upscaler models although they tend to hallucinate a bit, I personally use magic-image-refiner:

https://replicate.com/collections/super-resolution


This is called super resolution (SR). 2x SR is pretty safe and easy (so every pixel in becomes 2x2 out, in your example 800x600->1600x1200). Higher scalings are a lot harder and prone to hallucination, weird texturing, etc.


thank you! will enjoy reviewing each of these


All learning is compression


It is not going to take off if it is not significantly better, and has browser support. WebP took off thanks to Chrome, while JPEG2000 floundered. If not native browser support, maybe the codec could be shipped by WASM or something?

The interesting diagram to me is the last one, for computational cost, which shows the 10x penalty of the ML-based codecs.


The thing about ML models is the penalty is a function of parameters and precision. It sounds like the researchers cranked them to max to try to get the very best compression. Maybe later they will take that same model, and flatten layers and quantize the weights to can get it running 100x faster and see how well it still compresses. I feel like neural networks have a lot of potential in compression. Their whole job is finding patterns.


Did JPEG2000 really flounder? If your concept of it being a consumer facing product as a direct replacement for JPEG, then I could see being unsuccessful in that respect. However, JPEG2000 has found its place in the professional side of things.


Yes, I do mean broad- rather than niche adoption. I myself used J2K to archive film scans.

One problem is that without broad adoption, support even in niche cases is precarious; the ecosystem is smaller. That makes the codec not safe for archiving, only for distribution.

The strongest use case I see for this is streaming video, where the demand for compression is highest.


That makes the codec not safe for archiving, only for distribution.

Could you explain what you mean by "not safe for archiving"? The standard is published and there are multiple implementations, some of which are open-source. There is no danger of it being a proprietary format with no publicly available specification.


Not the GP, but for archiving, you want to know that you'll be able to decode the files well into the future. If you adopt a format that's not well accepted and the code base gets dropped and not maintained so that in the future it is no longer able to be run on modern gear, your archive is worthless.

As a counter, J2K has been well established by the professional market even if your mom doesn't know anything about what it is. It has been standardized by the ISO, so it's not something that will be forgotten about. It's a good tool for the right job. It's also true that not all jobs will be the right ones for that tool


I was not thinking of J2K as being problematic for archiving but these new neural codecs. My point being that performance is only one of the criteria used to evaluate a codec.


Royalty costs are often the other.


For archiving, I'd recommend having a wasm decompressor along with some reference output. Could also ship an image viewer as an html file with all the code embedded.


Why the need for all things to be browser based? Why introduce the performance hit for something that brings no compelling justification? What problem is this solution solving? Why can't things just be native workflows and not be shoveled into a browser?


Not the parent but one imagines that WASM could be a good target for decompressing or otherwise decoding less-adopted formats/protocols because WASM is fairly broadly-adopted and seems to be at least holding steady if not growing as an executable format: it seems unlikely that WASM disappears in the foreseeable future.

Truly standard ANSI C along with a number of other implementation strategies (LLVM IR seems unlikely to be going anywhere) seem just as durable as WASM if not more, but there are applications where you might not want to need a C toolchain and WASM can be a fit there.

One example is IIUC some of the blockchain folks use WASM to do simultaneous rollout of iterations to consensus logic in distributed systems: everyone has to upgrade at the same time to stay part of the network.


Wasm is simple, well-defined, small enough that one person can implement the whole thing in a few weeks, and (unlike the JVM) is usable without its standard library (WASI).

LLVM isn't as simple: there's not really such thing as target-independent LLVM IR, there are lots of very specific keywords with subtle behavioural effects on the code, and it's hard to read. I think LLVM is the only full implementation of LLVM. (PNaCl was a partial reimplementation, but it's dead now.)

ANSI C is a very complicated language and very hard to implement correctly. Once Linux switches to another language or we stop using Linux, C will go the way of Fortran.

Part of archiving information has always been format shifting. Never think you can store information, forget about it for a thousand years (or even five), and have it available later.


I think we probably agree about most things but I’ll nitpick here and there.

ANSI C is among the simpler languages to have serious adoption. It’s a bit tricky to use correctly because much of its simplicity derives from leaving a lot of the complexity burden on the author or maintainer, but the language specification is small enough in bytes to fit on a 3.5” floppy disk, and I think there are conforming implementations smaller than that!

You seem to be alluding to C getting replaced by Rust as that’s the only other language with so much as a device driver to its name in the Linux kernel. Linus is on the record recently saying that it will be decades before Rust has a serious share of the core: not being an active kernel code tribute I’m inclined to trust his forecast more than anyone else’s.

But Rust started at a complexity level comparable to where the C/C++ ecosystem ended up after 40 years of maintaining substantial backwards compatibility, and shows no signs of getting simpler. The few bright spots (like syntax for the Either monad) seem to be getting less rather than more popular, the bad habits it learned from C++ (forcing too much into the trait system and the macro mechanism) seem to have all the same appeal that template madness does to C++ hackers who don’t know e.g. Haskell well. And in spite of the fact that like 80% of my user land is written in Rust, I’m unaware of even a single project that folks can’t live without that’s married to Rust.

Rust is very cool, does some things very well, and it wouldn’t be hard to do a version of it without net-negative levels of opinionated on memory management, but speaking for myself I’m still watching Nim and V and Zig and Jai and a bunch of other things, because Rust takes after its C++ heritage more than its Haskell heritage, and it’s not entrenched enough in real industry to justify its swagger in places like HN.

The game is still on for what comes after C: Rust is in the lead, but it’s not the successor C deserves.


Yeah, Rust may be a suitable C++ replacement, but it's not a suitable C replacement. (Arguably, C isn't a suitable C replacement…)


You've made the point much, much better than I did. Well said.


Huh, one more point for considering J2K for film scan archiving.


it's well past the considering stage. J2K is used more than people think even if we're not using to spread cat memes across the interwebs. J2K is used in DCPs sent to movie theaters for digital projections. J2K is used as lossless masters for films. the Library of Congress uses it as well. this isn't even attempting to make an exhaustive list of use, but it's not something being looked into. it's being used every day


Well, I meant for me personally. Currently using TIFF. :-)


But that's like saying it's difficult to drive your Formula 1 car to work every day. It's not meant for that, so it's not the car's fault. It's a niche thing built to satisfy the requirements of a niche need. I would suggest this is "you're holding it wrong" type of situations that isn't laughable.


There was absolutely an initiative to make J2K a widespread standard


There was absolutely an initiative to make Esperanto a widespread language. But neither point has anything to do with how things actually are


I think it is an interesting discussion, learning experience (no pun intended). I think this is more of a stop on a research project than a proposal; I could be wrong.


Better or cheaper, e.g. AV1?


How much vram is needed? And computing power? To open a webpage you soon need 24gb and 2 seconds of 1000 watts energy to uncompress images. Bandwidth is reduced from 2mb to only 20kb.


> Bandwidth is reduced from 2mb to only 20kb.

Plus the entire model, which comes with incorrect cache headers and must be redownloaded all the time.


How do we know we don't get hands with 16 fingers?


Valid point. Conventional codecs draw things on screen that are not in the original, too, but we are used to low quality images and videos, and learned to ignore the block edges and smudges unconsciously. NN models “recover” much complex and plausible-looking features. It is possible that some future general purpose image compressor would do the same thing to small numbers lossy JBIG2 did.


How do we know whether it's an image with 16 fingers or it just looks like 16 fingers to us?

I looked at the bear example above and I could see how either the AI thought that there was an animal face embedded in the fur or we just see the face in the fur. We see all kinds of faces on toast even though neither the bread slicers nor the toasters intend to create them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: