How badly will its lossy-ness change critical things? In 2013, there were Xerox ...

bluedino · 2024-03-18T02:29:21 1710728961

If I zoom all the way with my iPhone, the camera-assisting intelligence will mess up numbers too

qrian · 2024-03-18T03:38:21 1710733101

The mentioned Xerox copier incident was not an OCR failure, but the copier actively changed the numbers in the original image due to its image compression algorithm.

barfbagginus · 2024-03-18T04:30:05 1710736205

Here's some of the context: www.dkriesel.com/blog/2013/0810_xerox_investigating_latest_mangling_test_findings

Learn More: https://www.dkriesel.com/start?do=search&id=en%3Aperson&q=Xe...

Brief: Xerox machines used template matching to recycle the scanned images of individual digits that recur in the document. In 2013, Kriesel discovered this procedure was faulty.

Rationale: This method can create smaller PDFs, advantageous for customers that scan and archive numerical documents.

Prior art: https://link.springer.com/chapter/10.1007/3-540-19036-8_22

Tech Problem: Xerox's template matching procedure was not reliable, sometimes "papering over" a digit with the wrong digit!

PR Problem: Xerox press releases initially claimed this issue did not happen in the factory default mode. Kriesel demonstrated this was not true, by replicating the issue in all of the factory default compression modes including the "normal" mode. He gave a 2015 FrOSCon talk, "Lies, damned lies and scans".

Interesting work!

lifthrasiir · 2024-03-18T04:49:32 1710737372

Any lossy compressor changes the original image for better compression at expense of the perfect accuracy.

skandium · 2024-03-18T09:21:12 1710753672

Exactly, in practice the alternatives are either blocky artifacts (JPEG and most other traditional codecs), blurring everything (learned codecs optimised for MSE) or "hallucinating" patterns when using models like GANs. However, in practice even the generative side of compression models is evaluated against the original image rather than only output quality, so the outputs tend to be passable.

To see what a lossy generator hallucinating patterns means in practice, I recommend viewing HiFiC vs original here: https://hific.github.io/

im3w1l · 2024-03-18T11:22:22 1710760942

Tradtional lossy compressors have well-understood artifacts. In particular they provide guarantees such that you can confidently say that an object in the image could not be an artifact.

fieldcny · 2024-03-18T14:32:20 1710772340

The word perfect is misplaced, the trade off is size vs fidelity (aka accuracy)

thomastjeffery · 2024-03-18T16:01:31 1710777691

Lossy compression has the same problem it has always had: lossy metadata.

The contextual information surrounding intentional data loss needs to be preserved. Without that context, we become ignorant of the missing data. Worst case, you get replaced numbers. Average case, you get lossy->lossy transcodes, which is why we end up with degraded content.

There are only two places to put that contextual information: metadata and watermarks. Metadata can be written to a file, but there is no guarantee it will be copied with that data. Watermarks fundamentally degrade the content once, and may not be preserved in derivative works.

I wish that the generative model explosion would result in a better culture of metadata preservation. Unfortunately, it looks like the focus is on metadata instead.

_kb · 2024-03-18T08:37:44 1710751064

The suitable lossy-ness (of any compression method) is entirely dependant on context. There is no one size fits all approach for all uses cases.

One key item with emerging 'AI compression' techniques is the information loss is not deterministic which somewhat complicates assessing suitability.

fl7305 · 2024-03-18T14:27:05 1710772025

> the information loss is not deterministic

It is technically possible to make it deterministic.

The main reason you don't deterministic outputs today is that Cuda/GPU optimizations make the calculations run much faster if you let them be undeterministic.

The internal GPU scheduler will then process things in the order it thinks is fastest.

Since floating point is not associative, you can get different results for (a + (b + c)) and ((a + b) + c).

_kb · 2024-03-18T21:02:35 1710795755

The challenge goes beyond rounding errors.

Many core codecs are pretty good at adhering to reference implementations, but are still open to similar issues so may not be bit exact.

With a DCT or wavelet transform, quantisation, chroma subsampling, entropy coding, motion prediction and the suite of other techniques that go into modern media squishing it’s possible to mostly reason about what type of error will come out the other end of the system for a yet to be seen input.

When that system is replaced by a non-linear box of mystery, this ability is lost.

begueradj · 2024-03-18T11:05:19 1710759919

That was interesting (info in your link)

lifthrasiir · 2024-03-18T04:48:02 1710737282

This JBIG2 "myth" is too widespread. It is true that Xerox's algorithm mangled some numbers in its JBIG2 output, but it is not an inherent flaw of JBIG2 to start, and Xerox's encoder misbehaved almost exclusively for lower dpis---300dpi or more was barely affected. Other artifacts at lower resolution can exhibit similar mangling as well (specifics would of course vary), and this or similar incident wasn't repeated so far. So I don't feel it is even a worthy concern at this point.

thrdbndndn · 2024-03-18T06:17:51 1710742671

1. No one, at least not OP, ever said it's a inherent flaw of JBIG2. The fact it's an implementation error on XeroX's end is a good technical detail to know, but it is irrelevant to the topic.

2. "Lower DPI" is extremely common if your definition for that is 300dpi. At my company, all the text document are scanned at 200dpi by default. And 150dpi or even lower is perfectly readable if you don't use ridiculous compression ratios.

> Other artifacts at lower resolution can exhibit similar mangling as well (specifics would of course vary)

Majority of traditional compressions would make text unreadable when compression is too high or the source material is too low-resolution. They don't substitute one number for another in an "unambiguous" way (i.e. it clearly shows a wrong number instead of just a blurry blob that could be both).

The "specifics" here is exactly what the whole topic is focus on, so you can't really gloss over it.

lifthrasiir · 2024-03-18T08:14:47 1710749687

> 1. No one, at least not OP, ever said it's a inherent flaw of JBIG2. The fact it's an implementation error on XeroX's end is a good technical detail to know, but it is irrelevant to the topic.

It is relevant only when you assume that lossy compression has no way to control or even know of such critical changes. In reality most lossy compression algorithms use a rate-distortion optimization, which is only possible when you have some idea about "distortion" in the first place. Given that the error rarely occurred in higher dpis, its cause should have been either a miscalculation of distortion or a misconfiguration of distortion thresholds for patching.

In any case, a correct implementation should be able to do the correct thing. It would have been much problematic if similar cases were repeated, since it would mean that it is much harder to write a correct implementation than expected, but that didn't happen.

> Majority of traditional compressions would make text unreadable when compression is too high or the source material is too low-resolution. They don't substitute one number for another in an "unambiguous" way (i.e. it clearly shows a wrong number instead of just a blurry blob that could be both).

Traditional compressions simply didn't have much computational power to do so. The "blurry blob" is something with lower-frequency components only by definition, and you have only a small number of them, so they were easier to preserve even with limited resources. But if you have and recognize a similar enough pattern, it should be exploited for further compression. Motion compensation in video codecs were already doing a similar thing, and either a filtering or intelligent quantization that preserves higher-frequency components would be able to do so too.

----

> 2. "Lower DPI" is extremely common if your definition for that is 300dpi. At my company, all the text document are scanned at 200dpi by default. And 150dpi or even lower is perfectly readable if you don't use ridiculous compression ratios.

I admit I have generalized too much, but the choice of scan resolution is highly specific to contents, font sizes and even writing systems. If you and your company can cope with lower DPIs, that's good for you, but I believe 300 dpi is indeed the safe minimum.