I was doing the same thing at Netflix around the same time as a 20% research project. Training GANs end2end directly in JPEG coeffs space (and then rebuild a JPEG from the generated coeffs using libjpeg to get an image). The pitch was that it not only worked, but you could get fast training by representing each JPEG block as a dense + sparse vector (dense for the low DCT coeffs, sparse for the high ones since they're ~all zeros) and using a neural network library with fast ops on sparse data.
Training on pixels is inefficient. Why have your first layers of CNNs relearn what's already smartly encoded in the JPEG bits in the first place before it's blown into a bloated height x width x 3 float matrix?
As an AI armchair quarterback, I've always held the opinion that the image ML space has a counter-productive bias towards not pre-processing images stemming from a mixture of test purity and academic hubris. "Look what this method can learn from raw data completely independently!" makes for a nice paper. So, they stick with sRGB inputs rather than doing basic classic transforms like converting to YUV420. Everyone learns from the papers, so that's assumed to be the standard practice.
In my experience, staying close to the storage format is very useful because it allows the neural network to correctly deal with clipped/saturated values. If your file is saved in sRGB and you train in sRGB, then when something turns to 0 or 255, the AI can handle it as a special case because most likely it was too bright or too dark for your sensor to capture accurately. If you first transform to a different color space, that clear clip boundary gets lost.
Also, I would prefer sRGB oder RGB because it more closely matches the human vision system. That said, the RGB to YUV transformation is effectively a matrix multiplication, so if you use conv features like everyone then you can merge it into your weights for free.
Typical CNNs can learn a linear transformation of the input at no cost. Since YUV is such a linear transformation of RGB, there is no benefit in converting to it beforehand.
How is there not a cost associated with forcing the machine to learn how to do something that we already have a simple, deterministic algorithm for? Won't some engineer need to double check a few things with regard to the AI's idea of color space transform?
You could probably derive some smart initialization for the first layer of a NN based on domain knowledge (color spaces, sobel filters, etc.). But since this is such a small part of what the NN has to learn, I expect this to result in a small improvement in training time and have no effect on final performance and accuracy, so it's unlikely to be worth the complexity of developing such a feature.
Your instincts are correct. Training is faster, more stable, and more efficient that way. In certain cases it "pretty much is irrelevant" but the advantages of the strategy of modelling the knowns and training only on the unknowns becomes starkly apparent when doing e.g. sensor fusion or other ML tasks on physical systems.
The bitter lesson is about not trying to encode impossible-to-formalize conceptual knowledge, not avoiding data efficiency and the need to scale the model up to ever higher parameter counts.
If we followed this logic, we'd be training LLMs on character-level UTF-32 and just letting it figure everything out by itself, while needing two orders of magnitude bigger contexts and parameter counts.
Converting from RGB to YUV is absolutely subject to the bitter lesson because it is trying to generalize from a representation that we have seen works for some classical methods and hard code that knowledge in to the AI which could easily learn (and will anyways) a more useful representation for itself.
> LLMs on character-level UTF-32 and just letting it figure everything out by itself, while needing two orders of magnitude bigger contexts and parameter counts.
This was tried extensively and honestly it is probably still too early to proclaim the demise of this approach. It's also completely different - you're conflating a representation that literally changes the number of forward passes you have to do (ie. the amount of computation - what the bitter lesson is about) vs. one that (at most) would just require stacking on a few layers or so.
A better example for your point (imo) would be audio recognition, where we pre-transform from wave amplitudes into log mel spectrogram for ingestion by the model. I think this will ultimately fall to the bitter lesson as well though.
Also a key difference is that you are proposing going from methods that already work to try to inject more classical knowledge into them. It is oftentimes the case that you'll have an intermediary fusion between deep + classical, but not if you already have working fully deep methods.
Heck why even go that far? Given how much texts we have in scanned books, just feed it scans of the books and let it dedicate a bunch of layers to learning OCR.
Or given the number of unscanned books, even just give it the controls for a book scanner, the books and probably some robot arms. Then let it figure out the scanning first in some layers. Shouldn't be that hard.
Right... but I don't see how that means that it doesn't fall to the bitter lesson.
The bitter lesson is not saying that the model will always relearn the same representation as the one that has been useful to humans in the past, merely that the model will learn a better representation for the task at hand than the one hand-coded by humans.
If the model could easily learn the representation useful to humans, then it will fall to the bitter lesson because at minimum the model could easily follow our path (it's just an affine transformation to learn) and more probably will learn very different (& better) representations for itself.
LLMs can't reason about spelling, e.g. asking for a sentence which contains no letter "a"; and can also struggle with rhyming, etc. The most obvious explanation is that they never 'see' the underlying letters/spelling, only tokens.
Been hearing that for half my adult life. People were 100% sure multicore in 2005 meant manufacturers were officially signalling it and it was time to invest in auto-parallelizable code.
I don't think it's wrong, but looking at it through a child's eyes, we do keep finding ways to do things we couldn't a couple years ago: an open mind on hardware and more focus on software is continuing deep innovation cycles
Leaving aside that we're still far from hitting the limits to growth outlined in that book, and that we can exceed those limits to growth by expanding outside of Earth, what does a book about physical limitations on agriculture and industry have to do with limitations on computing efficiency? There is of course some fundamental limit to computing efficiency, but for all we know we could be many orders of magnitude away from hitting it.
We've clearly fallen behind the exponential curve on clock speed. But the great thing is we can parallelize transformers, so it's not as big of a deal.
It's also probably because a lot of knowledge about how jpeg works is tied up signal processing books that usually front load a bunch of mathematics as opposed to ML which often needs a bit of mathematical intuition but in practice is usually empirical.
I was trying to learn how the discrete cosine transform works, so I looked up some code in an open source program. The code said it copied it verbatim from a book from the 90s. I looked up the book and the book said it copied it verbatim from a paper from the 1970s.
We used to do a lot more preprocessing to NLP, like stemming, removing stop words, or even adding grammar information (NP, VP, etc.).
Now we just do basic tokenization. The rest turned out to be irrelevant or even counter productive.
But also, that basic tokenization is essential; training it on a raw ascii stream would be much less efficient. There is a sweet spot of processing & abstraction that should be aimed for.
Short text representations (via good tokenization) significantly reduces the computational cost of a transformer (need to generate fewer tokens for the same output length, and need fewer tokens to represent the same window size). I think these combine to n^3 scaling (n^2 from window size and n from output size).
For images it's not clear to me if there are any preprocessing methods that do a lot better than resizing the image to a smaller resolution (which is commonly done already).
> rather than doing basic classic transforms like converting to YUV420
What will converting to YUV420 achieve though, except for 4:2:0 chroma subsampling? YUV has little basis in human perception to begin with, it's a color television legacy model used for compression. There are much better models if you want to extract the perceptual information from the picture.
I would worry that the fixed, non-overlapping block nature of a JPEG would reduce translation invariance - shift an image by 4 pixels and the DCT coefficients may look very different. People have been doing a lot of work to try to reduce the dependence of the image on the actual pixel coordinates - see for example https://research.nvidia.com/publication/2021-12_alias-free-g...
JPEG-2000 uses wavelets as a decomposition basis as opposed to DCT which in theory makes it possible to treat the whole image as a single block while ensuring high compression. In practice though tiles are used, I would guess to improve on memory and compute parallelism.
In naive ML scenarios you are right. You can think of JPEG as an input embedding. One of many. The JPEG/spectral embedding is useful because it already provides miniature variational encoding that "makes sense" in terms of translation, sharpness, color, scale and texture.
But with clever ML you can design better variational characteristics such as rotation or nonlinear thing like faces, fingers, projections and abstract objects.
Further JPEG encoding/decoding will be an obstacle for many architectures that require gradients going back and forth between pixel space and JPEG in order to do evaluation steps and loss functions based on the pixel space (which would be superior). Not to mention if you need human feedback in generative scenarios to retouch the output and run training steps on the changed pixels.
And finally, there are already picture and video embeddings that are gradient-friendly and reusable.
>And finally, there are already picture and video embeddings that are gradient-friendly and reusable.
I have been thinking about such things for a while and considered things like giving each of R rows and each of C columns a vector, and using the inner product of row_i and col_i as that pixel's intensity (in the simplest demonstrative case monochromatic, but reordering the floats in each vector before taking the inner product allows many more channels).
But this is just my quick shallow concoction. If I look at the konicq10k dataset, there are 10373 images 1024 x 768 totaling to 5.3GB. Thats ~511KB per image. 511KB / ( 1024 + 768 ) = 285 bytes for each row or column. Dividing by 4 for standard floats that gives each column and each row a vector of 71 (32-bit) floats. This would use absolutely no prior knowledge about human visual perception, so fitting these float vectors inner products (and their permutations for different channels) to the image by the most naive metric (average per pixel residual error) will probably not result in great images. But I'm curious how bad it performs. Perhaps I will try it out in a few hours.
Do you have any references for such or similar simplistic embeddings? I don't want to force you to dig for me, but if you happen to know of a few such papers or perhaps even a review paper that would be welcome!
Not aware of this type of simplistic embeddings. I think getting the first few layers of a large pretrained vision model will get you better results. Blindly learning from a generic image would probably steer it toward the dominant textures and shapes rather than linear operation of camera moves.
The simplest embeddings for vision should focus on camera primitives and invariants. Translation, rotation, scale, skew, projections, lighting. It doesn't matter that much what you use in the layers, but you should steer the training with augmented data. Like rotate and skew the objects in the batches to make sure the layers are invariant to these things.
Next are some depth-mapping embeddings which go beyond flat camera awareness.
The best papers I've seen are face embeddings. You can get useful results with smaller models. There are of course deeper embeddings that focus on the whole scene and depth maps but those are huge.
I wondered awhile back about using EGA or CGA for doing image recognition and or for stable diffusion. seems like there should be more than enough image data for that resolution and color depth(16 colors).
I remember I've heard somewhere that our retina encodes the visuals it receives into a compressed signal before forwarding it to the visual cortex. If true, this may actually be how it's done "for real". ;)
> Why have your first layers of CNNs relearn what's already smartly encoded in the JPEG bits in the first place before it's blown into a bloated height x width x 3 float matrix?
Because data augmentation is much easier in the latter representation.
Also, if you rotate images as part of data augmentation, then that is already so expensive that any speedup from going directly to JPEG becomes negligible in comparison.
Training on pixels is inefficient. Why have your first layers of CNNs relearn what's already smartly encoded in the JPEG bits in the first place before it's blown into a bloated height x width x 3 float matrix?