I was doing the same thing at Netflix around the same time as a 20% research pro...

corysama · on July 13, 2023

As an AI armchair quarterback, I've always held the opinion that the image ML space has a counter-productive bias towards not pre-processing images stemming from a mixture of test purity and academic hubris. "Look what this method can learn from raw data completely independently!" makes for a nice paper. So, they stick with sRGB inputs rather than doing basic classic transforms like converting to YUV420. Everyone learns from the papers, so that's assumed to be the standard practice.

fxtentacle · on July 13, 2023

In my experience, staying close to the storage format is very useful because it allows the neural network to correctly deal with clipped/saturated values. If your file is saved in sRGB and you train in sRGB, then when something turns to 0 or 255, the AI can handle it as a special case because most likely it was too bright or too dark for your sensor to capture accurately. If you first transform to a different color space, that clear clip boundary gets lost.

Also, I would prefer sRGB oder RGB because it more closely matches the human vision system. That said, the RGB to YUV transformation is effectively a matrix multiplication, so if you use conv features like everyone then you can merge it into your weights for free.

asah · on July 14, 2023

Would this apply to music formats as well ?

CodesInChaos · on July 13, 2023

Typical CNNs can learn a linear transformation of the input at no cost. Since YUV is such a linear transformation of RGB, there is no benefit in converting to it beforehand.

bob1029 · on July 13, 2023

How is there not a cost associated with forcing the machine to learn how to do something that we already have a simple, deterministic algorithm for? Won't some engineer need to double check a few things with regard to the AI's idea of color space transform?

CodesInChaos · on July 13, 2023

You could probably derive some smart initialization for the first layer of a NN based on domain knowledge (color spaces, sobel filters, etc.). But since this is such a small part of what the NN has to learn, I expect this to result in a small improvement in training time and have no effect on final performance and accuracy, so it's unlikely to be worth the complexity of developing such a feature.

whimsicalism · on July 13, 2023

Absolutely this.

Seems like on HN people are still learning 'the bitter lesson'.

CSSer · on July 14, 2023

Amdahl’s law?

nl · on July 14, 2023

Rick Sutton.

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

CSSer · on July 14, 2023

Thank you!

whimsicalism · on July 14, 2023

Sorry - should have included a cite.

That said, Amdahl's law is also probably related in some degree - I would view YUV conversion as an unnecessary optimization.

uoaei · on July 13, 2023

Your instincts are correct. Training is faster, more stable, and more efficient that way. In certain cases it "pretty much is irrelevant" but the advantages of the strategy of modelling the knowns and training only on the unknowns becomes starkly apparent when doing e.g. sensor fusion or other ML tasks on physical systems.

whimsicalism · on July 13, 2023

In deep ML, people are pretty familiar with the bitter lesson and don't want to waste time on this.

Llamamoe · on July 13, 2023

The bitter lesson is about not trying to encode impossible-to-formalize conceptual knowledge, not avoiding data efficiency and the need to scale the model up to ever higher parameter counts.

If we followed this logic, we'd be training LLMs on character-level UTF-32 and just letting it figure everything out by itself, while needing two orders of magnitude bigger contexts and parameter counts.

whimsicalism · on July 13, 2023

Converting from RGB to YUV is absolutely subject to the bitter lesson because it is trying to generalize from a representation that we have seen works for some classical methods and hard code that knowledge in to the AI which could easily learn (and will anyways) a more useful representation for itself.

> LLMs on character-level UTF-32 and just letting it figure everything out by itself, while needing two orders of magnitude bigger contexts and parameter counts.

This was tried extensively and honestly it is probably still too early to proclaim the demise of this approach. It's also completely different - you're conflating a representation that literally changes the number of forward passes you have to do (ie. the amount of computation - what the bitter lesson is about) vs. one that (at most) would just require stacking on a few layers or so.

A better example for your point (imo) would be audio recognition, where we pre-transform from wave amplitudes into log mel spectrogram for ingestion by the model. I think this will ultimately fall to the bitter lesson as well though.

Also a key difference is that you are proposing going from methods that already work to try to inject more classical knowledge into them. It is oftentimes the case that you'll have an intermediary fusion between deep + classical, but not if you already have working fully deep methods.

ummonk · on July 14, 2023

Heck why even go that far? Given how much texts we have in scanned books, just feed it scans of the books and let it dedicate a bunch of layers to learning OCR.

philipphutterer · on July 14, 2023

Or given the number of unscanned books, even just give it the controls for a book scanner, the books and probably some robot arms. Then let it figure out the scanning first in some layers. Shouldn't be that hard.

whimsicalism · on July 14, 2023

RGB->YUV is literally an affine transform, of course it falls to the bitter lesson.

Llamamoe · on July 16, 2023

Does it? Because I'm not sure the model has any intrinsic incentive to learn to follow how human perception works.

whimsicalism · on July 17, 2023

Right... but I don't see how that means that it doesn't fall to the bitter lesson.

The bitter lesson is not saying that the model will always relearn the same representation as the one that has been useful to humans in the past, merely that the model will learn a better representation for the task at hand than the one hand-coded by humans.

If the model could easily learn the representation useful to humans, then it will fall to the bitter lesson because at minimum the model could easily follow our path (it's just an affine transformation to learn) and more probably will learn very different (& better) representations for itself.

refulgentis · on July 13, 2023

This will absolutely be the case N doublings of Moore's law from here. Tokens are information loss.

d110af5ccf · on July 14, 2023

Information loss, or the result of useful computation? VAEs exist after all.

chriswarbo · on July 14, 2023

LLMs can't reason about spelling, e.g. asking for a sentence which contains no letter "a"; and can also struggle with rhyming, etc. The most obvious explanation is that they never 'see' the underlying letters/spelling, only tokens.

Aerbil313 · on July 14, 2023

Keep in mind Moore's law is coming to its end.

refulgentis · on July 14, 2023

Been hearing that for half my adult life. People were 100% sure multicore in 2005 meant manufacturers were officially signalling it and it was time to invest in auto-parallelizable code.

I don't think it's wrong, but looking at it through a child's eyes, we do keep finding ways to do things we couldn't a couple years ago: an open mind on hardware and more focus on software is continuing deep innovation cycles

Aerbil313 · on July 14, 2023

There are limits to growth[1]. God-like tech utopia isn't and won't be real.

1: https://www.clubofrome.org/publication/the-limits-to-growth/

ummonk · on July 15, 2023

Leaving aside that we're still far from hitting the limits to growth outlined in that book, and that we can exceed those limits to growth by expanding outside of Earth, what does a book about physical limitations on agriculture and industry have to do with limitations on computing efficiency? There is of course some fundamental limit to computing efficiency, but for all we know we could be many orders of magnitude away from hitting it.

Aerbil313 · on July 16, 2023

The original study has been studied again and it has proven true so far. An analysis: https://medium.com/@CollapseSurvival/overshoot-why-its-alrea... Humanity likely won’t ever be able to permanently settle outside earth.

ummonk · on July 21, 2023

Do I need to repeat myself? What do limits on agriculture have to do with limits on computing?

refulgentis · on July 15, 2023

^ equivalent of ideological salesman ringing my doorbell. Absolutely nothing to do with anything I said.

whimsicalism · on July 14, 2023

We've clearly fallen behind the exponential curve on clock speed. But the great thing is we can parallelize transformers, so it's not as big of a deal.

eru · on July 14, 2023

If you want to play by the bitter lesson, why don't you just feed the raw JPEG bits into your neural network?

mhh__ · on July 13, 2023

It's also probably because a lot of knowledge about how jpeg works is tied up signal processing books that usually front load a bunch of mathematics as opposed to ML which often needs a bit of mathematical intuition but in practice is usually empirical.

andai · on July 13, 2023

I was trying to learn how the discrete cosine transform works, so I looked up some code in an open source program. The code said it copied it verbatim from a book from the 90s. I looked up the book and the book said it copied it verbatim from a paper from the 1970s.

mhh__ · on July 13, 2023

The code is irrelevant, in the eyes of the theoretician anyway.

andai · on July 13, 2023

I am theoretically challenged.

smrtinsert · on July 14, 2023

End to end ml models are the goal regardless of efficiency just like software engs aim for higher level interfaces

potatoman22 · on July 13, 2023

That's something that surprises me too, given the preprocessing applied to NLP.

thomasahle · on July 13, 2023

We used to do a lot more preprocessing to NLP, like stemming, removing stop words, or even adding grammar information (NP, VP, etc.). Now we just do basic tokenization. The rest turned out to be irrelevant or even counter productive.

kbelder · on July 13, 2023

But also, that basic tokenization is essential; training it on a raw ascii stream would be much less efficient. There is a sweet spot of processing & abstraction that should be aimed for.

CodesInChaos · on July 13, 2023

Short text representations (via good tokenization) significantly reduces the computational cost of a transformer (need to generate fewer tokens for the same output length, and need fewer tokens to represent the same window size). I think these combine to n^3 scaling (n^2 from window size and n from output size).

For images it's not clear to me if there are any preprocessing methods that do a lot better than resizing the image to a smaller resolution (which is commonly done already).

orbital-decay · on July 14, 2023

> rather than doing basic classic transforms like converting to YUV420

What will converting to YUV420 achieve though, except for 4:2:0 chroma subsampling? YUV has little basis in human perception to begin with, it's a color television legacy model used for compression. There are much better models if you want to extract the perceptual information from the picture.

johntb86 · on July 13, 2023

I would worry that the fixed, non-overlapping block nature of a JPEG would reduce translation invariance - shift an image by 4 pixels and the DCT coefficients may look very different. People have been doing a lot of work to try to reduce the dependence of the image on the actual pixel coordinates - see for example https://research.nvidia.com/publication/2021-12_alias-free-g...

DougBTX · on July 13, 2023

On the other hand, ViT uses non-overlapping patches anyway, so the impact may be minor. Example code: https://nn.labml.ai/transformers/vit/index.html

andai · on July 13, 2023

Does JPEG-2000 fix that? From what I gathered, it doesn't use blocks.

mochomocha · on July 13, 2023

JPEG-2000 uses wavelets as a decomposition basis as opposed to DCT which in theory makes it possible to treat the whole image as a single block while ensuring high compression. In practice though tiles are used, I would guess to improve on memory and compute parallelism.

vladimirralev · on July 13, 2023

In naive ML scenarios you are right. You can think of JPEG as an input embedding. One of many. The JPEG/spectral embedding is useful because it already provides miniature variational encoding that "makes sense" in terms of translation, sharpness, color, scale and texture.

But with clever ML you can design better variational characteristics such as rotation or nonlinear thing like faces, fingers, projections and abstract objects.

Further JPEG encoding/decoding will be an obstacle for many architectures that require gradients going back and forth between pixel space and JPEG in order to do evaluation steps and loss functions based on the pixel space (which would be superior). Not to mention if you need human feedback in generative scenarios to retouch the output and run training steps on the changed pixels.

And finally, there are already picture and video embeddings that are gradient-friendly and reusable.

DoctorOetker · on July 15, 2023

>And finally, there are already picture and video embeddings that are gradient-friendly and reusable.

I have been thinking about such things for a while and considered things like giving each of R rows and each of C columns a vector, and using the inner product of row_i and col_i as that pixel's intensity (in the simplest demonstrative case monochromatic, but reordering the floats in each vector before taking the inner product allows many more channels).

But this is just my quick shallow concoction. If I look at the konicq10k dataset, there are 10373 images 1024 x 768 totaling to 5.3GB. Thats ~511KB per image. 511KB / ( 1024 + 768 ) = 285 bytes for each row or column. Dividing by 4 for standard floats that gives each column and each row a vector of 71 (32-bit) floats. This would use absolutely no prior knowledge about human visual perception, so fitting these float vectors inner products (and their permutations for different channels) to the image by the most naive metric (average per pixel residual error) will probably not result in great images. But I'm curious how bad it performs. Perhaps I will try it out in a few hours.

Do you have any references for such or similar simplistic embeddings? I don't want to force you to dig for me, but if you happen to know of a few such papers or perhaps even a review paper that would be welcome!

vladimirralev · on July 17, 2023

Not aware of this type of simplistic embeddings. I think getting the first few layers of a large pretrained vision model will get you better results. Blindly learning from a generic image would probably steer it toward the dominant textures and shapes rather than linear operation of camera moves.

The simplest embeddings for vision should focus on camera primitives and invariants. Translation, rotation, scale, skew, projections, lighting. It doesn't matter that much what you use in the layers, but you should steer the training with augmented data. Like rotate and skew the objects in the batches to make sure the layers are invariant to these things.

Next are some depth-mapping embeddings which go beyond flat camera awareness.

The best papers I've seen are face embeddings. You can get useful results with smaller models. There are of course deeper embeddings that focus on the whole scene and depth maps but those are huge.

sharemywin · on July 13, 2023

I wondered awhile back about using EGA or CGA for doing image recognition and or for stable diffusion. seems like there should be more than enough image data for that resolution and color depth(16 colors).

luckystarr · on July 13, 2023

I remember I've heard somewhere that our retina encodes the visuals it receives into a compressed signal before forwarding it to the visual cortex. If true, this may actually be how it's done "for real". ;)

actionfromafar · on July 13, 2023

The retina does a lot of processing.

amelius · on July 14, 2023

> Why have your first layers of CNNs relearn what's already smartly encoded in the JPEG bits in the first place before it's blown into a bloated height x width x 3 float matrix?

Because data augmentation is much easier in the latter representation.

Also, if you rotate images as part of data augmentation, then that is already so expensive that any speedup from going directly to JPEG becomes negligible in comparison.

eru · on July 14, 2023

Fascinating.

How did this approach handle the same image being encoded in different ways by different JPEG libraries? Or just with different quality settings?

NL807 · on July 14, 2023

Seems like you are using similar tricks what compressed sensing people do, work with data in a sparser domain.

23B1 · on July 13, 2023

This is really interesting. Have you published your research anywhere that I could read?