Hacker News new | past | comments | ask | show | jobs | submit login
The dangers behind image resizing (2021) (zuru.tech)
306 points by qwertyforce on Feb 16, 2023 | hide | past | favorite | 97 comments



Problems with image resizing is a much deeper rabbit hole than this. Some important talking points:

1. The form of interpolation (this article).

2. The colorspace used for doing the arithmetic for interpolation. You most likely want a linear colorspace here.

3. Clipping. Resizing is typically done in two phases, once resizing in x then in y direction, not necessarily in this order. If the kernel used has values outside of the range [0, 1] (like Lanczos) and for intermediate results you only capture the range [0,1], then you might get clipping in the intermediate image, which can cause artifacts.

4. Quantization and dithering.

5. If you have an alpha channel, using pre-multiplied alpha for interpolation arithmetic.

I'm not trying to be exhaustive here. ImageWorsener's page has a nice reading list[1].

[1] https://entropymine.com/imageworsener/


I've definitely learned a lot about these problems from the viewpoint of art and graphic design. When using Pillow I convert to linear light with high dynamic range and work in that space.

One pet peeve of mine is algorithms for making thumbnails, most of the algorithms from the image processing book don't really apply as they are usually trying to interpolate between points based on a small neighborhood whereas if you are downscaling by a large factor (say 10) the obvious thing to do is sample the pixels in the input image that intersect with the pixel in the output image (100 in that case.)

That box averaging is a pretty expensive convolution so most libraries usually downscale images by powers of 2 and then interpolate from the closest such image which I think is not quite perfect and I think you could do better.


If you downscale by a factor of 2 using bandlimited resampling every time, followed by a single final shrink, you'll theoretically get identical results to a single bandlimited shrinking operation. Of course real world image resampling kernels (Lanczos, cubic, magic kernel) are very much truncated compared to the actual sinc kernel (to avoid massive ringing which looks unacceptable in images), so the results won't be mathematically perfect. And linear/area-based resampling is even less mathematically optimal, although they don't cause overshoot.


Isn't this generally addressed by applying a gaussian blur before downsizing? I know this introduces an extra processing step, but I always figured this was necessary.


That's an even more expensive convolution since you're going to average 100 or so points for each of those 100 points!

Practically people think that box averaging is too expensive (pretty much it is like that Gaussian blur but computed on fewer output points.)


Box filtering should be pretty cheap; it is separable, and can be implemented with a moving average. Overall just a handful of operations per pixel.


Gaussian blur can be pretty cheap too using an IIR approximation[1]. It's separable also either way.

[1]: https://www.intel.com/content/dam/develop/external/us/en/doc...


I played a little with FFT Gaussian blur. It uses the frequency domain, and so does not have to average hundreds of points, but rather transforms the image and the blur kernel into the frequency domain. There it performs a pointwise multiplication and transforms the image back. It's way faster than the direct convolution.


Having to process 100 source pixels per destination pixel to shrink 10x seems like an inefficient implementation. If you downsample each dimension individually you only need to process 20 pixels per pixel. This is the same optimization used for Gaussian blur.


> If you downsample each dimension individually you only need to process 20 pixels per pixel.

If you shrink 10x in one direction, then the other, then you first turn 100 pixels into 10, before turning 10 pixels into 1. You actually do more work for a non-smoothed shrink, sampling 110 pixels total.

To benefit from doing the dimensions separately, the width of your sample has to be bigger than the shrink factor. The best case is a blur where you're not shrinking at all, and that's where 20:1 actually happens.

If you sampled 10 pixels wide, then shrunk by a factor of 3, you'd have 100 samples per output if you do both dimensions at the same time, and 40 samples per output if you do one dimension at a time.

Two dimensions at the same time need width^2 samples

Two dimensions, one after the other, need width*(shrink_factor + 1) samples


You're right, I got confused. I was think of Gaussian blur, where the areas to process overlap heavily. Here there's zero overlap.


Yeah I was shocked at how naive this quote is:

>> The definition of scaling function is mathematical and should never be a function of the library being used.

I could just as easily say "hey, why is you NN affected by image artifacts, isn't it supposed to be robust?"


> 3. Clipping. Resizing is typically done in two phases, once resizing in x then in y direction, not necessarily in this order. If the kernel used has values outside of the range [0, 1] (like Lanczos) and for intermediate results you only capture the range [0,1], then you might get clipping in the intermediate image, which can cause artifacts.

Also, gamut clipping and interpolation[0]. That's a real rabbithole.

[0] https://www.cis.rit.edu/people/faculty/montag/PDFs/057.PDF (Downloads a PDF)


Captain D on premulitplication and the alpha channel (with regards to video): https://www.youtube.com/watch?v=XobSAXZaKJ8


Wow, points 2, 3 and 5 wouldn't have occured to me even if I tried. Thanks. I now have a mental note to look stuff up if my resizing ever gives results I'm not happy with. :)


Point 2 is the most important one, and the most egregious error. Even most browsers implement it wrong (at least the last time I checked, I confirmed it again with Edge).

Here is the most popular article about this problem [1].

Warning: once you start noticing incorrect color blending done in sRGB space, then you will see it everywhere.

[1] http://www.ericbrasseur.org/gamma.html


Browsers (and other tools) can't even agree on the color space for some images, e.g. "Portable" Network Graphics.


Browsers now 'deliberately' do it wrong, because web developers have come to rely on the fact that a 50/50 blend of #000000 and #FFFFFF is #808080


I'm a little bit sympathetic for doing it wrong on gradients (having said that SVG spec has an opt-in to do the interpolation in linear colorspace, and browsers don't implement it). But not for images.


Linear RGB blending also requires >8 bit per channel for the result to avoid noticeable banding.

It is unquestionably superior though.



Did the link go down between when you posted this and now? It now lead to http://suspendeddomain.org/index.php?host=www.ericbrasseur.o...


I imagine that beyond just using linearized srgb using perceptually uniform colorspace such as oklab would bring further improvement. Although I suppose the effect might be somewhat subtle in most real-world images.


For downscaling, I doubt that. If you literally squint your eyes or unfocus your eyes, then colors you see will be mixed in a linear colorspace. It makes sense for downscaling to follow that.

Upscaling is much more difficult.


When image generating AIs first appeared, the color space interpolations were terribly wrong. One could see hue rainbows practically anywhere blending occurred.


I'd also add speed to that list. Resizing is an expensive operation. Correctness is often traded off for speed. I've written code that deliberately ignored the conversation to a linear color space and back in order to gain speed.


A connected rabbit hole is image decoding of lossy format such as jpeg: from my experience depending on the library used (opencv vs tensorflow vs pillow) you get rgb values that varies between 1-2% of each others with default decoders.


And also (for humans at least) the rabbit hole coming from effectively displaying the resulting image : various forms of subpixel rendering for screens, various forms of printing... which are likely to have a big influence on what is "acceptable quality" or not.


Another thing I had experienced before was a document picture I used after downsizing to mandatory upload size had a character/number randomly changed (6 to b or d). Don't remember which exactly and had to convert the doc to PDF that managed it better.


Wouldn't the clipping be solved by using floating point numbers during the filtering process?


It would. It would also not accumulate quantization errors from an intermediate result. Having said that there are precedents for having the intermediate image pixels in integral values.

Here is imageworsener's article about this[1]

[1] https://entropymine.com/imageworsener/clamp-int/


I love sites like these. Had never heard of Image Worsener before. Thanks!


If you're doing interpolation you probably don't want a linear colourspace. At least not linear in the way that light works. Interpolation minimizes deviations in the colourspace you're in, so you want it to be somewhat perceptual to get it right.

Of course if you're not interpolating but downscaling the image (which isn't really an interpolation, the value at a particular position in the image does not remain the same) then you do want a linear colourspace to avoid brightening / darkening details, but you need a perceptual colourspace to minimize ringing etc. It's an interesting puzzle.


I'd argue that if your ML model is sensitive to the anti-aliasing filter used in image resizing, you've got bigger problems than that. Unless it's actually making a visible change that spoils whatever it is the model supposed to be looking for. To use the standard cat / dog example, filter choice or resampling choice is not going to change what you've got a picture of, and if your model is classifying based in features that change with resampling, it's not trustworthy.

If one is concerned about this, one could intentionally vary the resampling or deliberately add different blurring filters during training to make the model robust to these variations


> I'd argue that if your ML model is sensitive to the anti-aliasing filter used in image resizing, you've got bigger problems than that.

I’ve seen it cause trouble in every model architecture i’ve tried.


What kinds of model architectures? I'm curious to play with it myself


most object detection models will show variability in bounding box confidences and coordinates.

it’s not a huge instability, but you can absolutely see performance changes.


You say that “if your model is classifying based in features that change with resampling, it’s not trustworthy.”

I say that choice of resampling algorithm is what determines whether a model can learn the rule “zebras can be recognized by their uniform-width stripes” or not; as a bad resample will result in non-uniform-width stripes (or, at sufficiently small scales, loss of stripes!)


> whether a model can learn the rule “zebras can be recognized by their uniform-width stripes” or not

But zebras don't have uniform-width stripes. https://www.animalfactsencyclopedia.com/Zebra-facts.html


  Unless it's actually making a visible change that spoils whatever it is the model supposed to be looking for


A zebra having stripes that alternate between 5 black pixels, and 4 black pixels + 1 dark-grey pixel, isn’t actually a visible change to the human eye. But it’s visible to the model.


I'm not saying your general argument is wrong, but... zebra stripes are not made out of pixels. A model that requires a photograph of a zebra to align with the camera's sensor grid also has bigger problems.


For those going down this rabbit hole, perceptual downscaling is state of the art, and the closest thing we have to a Python implementation is here (with a citation of the original paper): https://github.com/WolframRhodium/muvsfunc/blob/master/muvsf...

Other supposedly better CUDA/ML filters give me strange results.


There are so many gems in VapourSynth scene.

I really wish there are some better general-purpose imaging libraries that steadily implement/copy these useful filters, so that more people can use them out of the box.

Most of languages I've involved are surprisingly lacking in this regard despite their huge potential use cases.

Like, in case of Python, Pillow is fine but it has nothing fancy. You can't even fine-tune parameters of bicubic, let alone billions of new algorithms from video communities.

OpenCV or ML tools like to re-invent the wheels themselves, but often only the most basic ones (and badly as noted in this article).


VapourSynth is great for ML stuff actually, as it can ingest/output numpy arrays or PNGs, and work with native FP32.

A big sticking point is variable resolution, which it technically supports but doesn't really like without some workarounds.

But yeah I agree, its kinda tragic that the ML community is stuck with the simpler stuff.


Hm, any examples of that?

I found https://dl.acm.org/doi/10.1145/2766891 but I don't like the comparisons. Any designer will tell you, after down-scaling you do a minimal sharpening pass. The "perceptual downscaling" looks slightly over-sharpened to me.

I'd love to compare something I sharpened in photoshop with these results.


That implementation is pretty easy to run! The whole Python block (along with some imports) is something like:

clip = core.imwri.Read(img)

clip = muf.ssim_downscale(clip, x, y)

clip = core.imwri.Write(clip, imgoutput)

clip.set_output()

> Any designer will tell you, after down-scaling you do a minimal sharpening pass

This is probably wisdom from bicubic scaling, but you usually dont need further sharpening if you use a "sharp" filter like Mitchell.

Anyway I havent run butteraugli or ssim metrics vs other scalers, I just subjectively observed that ssim_downscale was preserving some edges in video frames that Spline36, Mitchell, and Bicubic were not preserving.


> The definition of scaling function is mathematical and should never be a function of the library being used.

Horseshit. Image resizing or any other kind of resampling is essentially always about filling in missing information. The is no mathematical model that will tell you for certain what the missing information is.


Not at all. He is correct that those functions are defined mathematically and that the results should therefore be the same using any libraries which claim to implement them.

An example used in the article: https://en.wikipedia.org/wiki/Lanczos_resampling


Arguably downscaling does not fill in missing information, it only throws away information. Still, implementations vary a lot here. There might not be a consensus of a unique correct way to do downscaling, but there are certain things that you certainly don't want to do. Like doing naive linear arithmetic on sRGB color values.


Interpolation is still filling in missing information, it's just possible to get a pretty good estimate.


This is wrong. Interpolation below Nyquist (downsampling) results in a subset of the original Information (capital I information theory information).


Images aren't bandlimited so the conditions don't apply for that.

That's why a vector image rendered at 128x128 can look better/sharper than one rendered at 256x256 and scaled down.


They are band-limited. That's why you get aliasing when taking unfiltered photos above Nyquist without AA filters.

In your example the lower res image would be using most of its bandwidth while the higher res image would be using almost none of its bandwidth.

Images are 2D discrete signals. Everything you know about 1D DSP applies to them.


If some of the edges are infinitely sharp, and you know which ones they are by looking at them, as in my example, then it's using more than all its bandwidth at any resolution.


That's true in the 1D case as well. That requires upsampling with information generation before downsampling. Using priori to guess missing information is a task that will never be finished and is interesting. It isn't necessary for a satisfactory downsampling result.


One interesting complication for a lot of photos is that the bandwidth of the green channel is twice as high as the red and blue channels due to the Bayer filter mosaic.


Aha, no! Downscaling *into a discrete space by an arbitrary amount* is absolutely filling in missing information.

Take the naive case where you downscale a line of four pixels to two pixels - you can simply discard two of them so you go from `0,1,2,3` to `0,2`. It looks okay.

But what happens if you want to scale four pixels to three? You could simply throw one away but then things will look wobbly and lumpy. So you need to take your four pixels, and fill in a missing value that lands slap bang between 1 and 2. Worse, you actually need to treat 0 and 3 as missing values too because they will be somewhat affected by spreading them into the middle pixel.

So yes, downscaling does have to compute missing values even in your naive linear interpolation!


>Take the naive case where you downscale a line of four pixels to two pixels - you can simply discard two of them so you go from `0,1,2,3` to `0,2`. It looks okay.

This is already wrong, unless the pixels are band-limited to Nyquist/4. Trivial example where this is not true:

  1 0 1 0
If such a signal is decimated by 2 you get

  1 1
Which is not correct.


For downscaling, area averaging is simple and makes a lot of intuitive sense and gives good results. To me it's basically the definition of downscaling.

Like yeah, you can try to get clever and preserve the artistic intent or something with something like seamcarving but then I wouldn't call it downscaling anymore.



Hmm, maybe I was wrong then!


The article talks about downsampling, not upsampling, just so we are clear about that.

And besides, a ranty blog post pointing out pitfall can still be useful for someone else coming from the same naïve (in a good/neutral way) place as the author.


Now that's an interesting topic for photographers who like to experiment with anamorphic lenses for panoramas.

An anamorphic lens (optically) "squeezes" the image onto the sensor, and afterwards the digital image has to be "desqueezed" (i.e. upscaled in one axis) to give you the "final" image. Which in turn is downscaled to be viewed on either a monitor or a printout.

But the resulting images I've seen until now nevertheless look good. I think that's because in natural images you have not that many pixel-level details. And we mostly see downscaled images on the web or in youtube videos most of the time ...


I'm shocked. I don't even know this is a thing.

By that I mean, I know what bilinear/bicubic/lanczos resizing algorithms are, and I know they should at least have acceptable results (compared to NN).

But I don't know famous libraries (especially OpenCV which is a computer vision library!) could have such poor results.

Also a side note, IIRC bilinear and bicubic have constants in the equation. So technically when you're comparing different implementations you need to make sure this input (parameters) is the same. But this shouldn't excuse the extreme poor results in some.


At least bilinear and bicubic have a widely agreed upon specific definition. The poor results are the result of that definition. They work reasonably for upscaling, but downscaling more than a trivial amount causes them to weigh a few input pixels highly and outright ignore most of the rest.


> bicubic have a widely agreed upon specific definition

Not so fast: https://entropymine.com/imageworsener/bicubic/


Fair. To be clear the issue remains no matter the choice of these parameters.


I've seen more than one team find that reimplementing an OpenCV capability that they use gain them both in quality and performance.

This isn't necessarily a criticism of OpenCV, often the OpenCV implementation is, of necessity, quite general, and a specific use-case can engage optimizations not available in the general case


If their worry is the differences between algorithms in libraries in different execution environments, shouldn't they either find a library they like that can be called from all such environments or if they can't find one or there is no single library that can be used in all environments then shouldn't they just write their own using their favorite algorithm? Why make all libraries do this the same way? Which one is undeniably correct?


That's basically what they did, which they mention in the last paragraph of the article. They released a wrapper library [0] for Pillow so that it can be called from C++:

> Since we noticed that the most correct behavior is given by the Pillow resize and we are interested in deploying our applications in C++, it could be useful to use it in C++. The Pillow image processing algorithms are almost all written in C, but they cannot be directly used because they are designed to be part of the Python wrapper. We, therefore, released a porting of the resize method in a new standalone library that works on cv::Mat so it would be compatible with all OpenCV algorithms. You can find the library here: pillow-resize.

[0] https://github.com/zurutech/pillow-resize


Hmmm. With respect to feeding an ML system, are visual glitches and artifacts important? Wouldn't the most important thing to use a transformation which preserves as much information as possible and captures relevant structure? If the intermediate picture doesn't look great, who cares if the result is good.

Ooops. Just thought about generative systems. Nevermind.


Just speaking from experience, GAN upscalers pick up artifacts in the training dataset like a bloodhound.

You can use this to your advantage by purposely introducing them into the lowres inputs so they will be removed.


So, what are the dangers? (what's the point of the article?) That you'll get different model with same originals processed by different algorithms?

The comparison of resizing algorithms is not something new, importance of adequate input data is obvious, difference in image processing algorithms availability is also understandable. Clickbaity.


A friend of mine decided to take up image resizing on the third lane of a six-lane highway.

And he was hit by a truck.

So it's true about the danger of image resizing.


plot twist: a Tesla truck, with autopilot using bad image resizing algorithms )


If you read to the end, they link to a library they made for solving the problem by wrapping Pillow C functions to be callable in C++


Was hoping to see libvips in the comparison, which is widely used.

I wonder why it's not adopted by any of these frameworks?


I was sort of expecting them to describe this danger to resizing: one can feed a piece of an image into one of these new massive ML models and get back the full image - with things that you didn't want to share. Like cropping out my ex.

IS ML sort of like a universal hologram in that respect?


If you upscale (with interpolation) some sensitive image (think security camera), could that be dismissed in court as it "creates" new information that wasn't there in the original image?


The bigger problem is that the pixel domain is not a very good domain to be operating in. How many hours and of training and thousands of images are used to essentially learn about Gabor filters.


This article throws a red flag on proving negative(s). This is impossible with maths. The void is filled by human subjectivity. In a graphical sense, "visual taste."


What are some good image upscaler libraries that exist? I'm assuming the high quality ones would need to use some AI model to fill in missing detail.


Waifu2x - I've used the library to upscale both old photos and videos with enough success to be pleased with the results.

https://github.com/nagadomi/waifu2x


Depends on your needs!

Zimg is a gold standard to me, but yeah, you can get better output depending on the nature of your content and hardware. I think ESRGAN is state-of-the-art above 2x scales, with the right community model from upscale.wiki, but it is slow and artifacty. And pixel art, for instance, may look better upscaled with xBRZ.


Image resizing is one of those things that most companies seem to build in-house over and over. There are several hosted services, but obviously sending your users photos to a 3rd party is pretty weak. For those of us looking for a middle-ground: I've had great success with imgproxy (https://github.com/imgproxy/imgproxy) which wraps libvips and well is maintained.


funny that they use tf and pytorch in this context without even mentioning their fantastic upsampling capabilities


Is there any hacks/study to maximize the downsampling errors?

E.g. looks totally different on original vs 224x224 pictures


There is a "resizing attack" that's been published that does what you're suggesting

https://embracethered.com/blog/posts/2020/husky-ai-image-res...


torch.nn.functional.interpolate has an "antialias" switch that's off by default



You're right, looks like it was added with 1.11 on March 10, 2022. Seems like an important feature to miss so long!


downscaling images introduces artifacts and throws away information! news at 5!


Thought this article was going to be about DDOS...


I favored cropping even back in 2021


Came here for a new ImageTragick but got actual resizing problems


Finally someone said it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: