I was doing the same thing at Netflix around the same time as a 20% research project. Training GANs end2end directly in JPEG coeffs space (and then rebuild a JPEG from the generated coeffs using libjpeg to get an image). The pitch was that it not only worked, but you could get fast training by representing each JPEG block as a dense + sparse vector (dense for the low DCT coeffs, sparse for the high ones since they're ~all zeros) and using a neural network library with fast ops on sparse data.
Training on pixels is inefficient. Why have your first layers of CNNs relearn what's already smartly encoded in the JPEG bits in the first place before it's blown into a bloated height x width x 3 float matrix?
As an AI armchair quarterback, I've always held the opinion that the image ML space has a counter-productive bias towards not pre-processing images stemming from a mixture of test purity and academic hubris. "Look what this method can learn from raw data completely independently!" makes for a nice paper. So, they stick with sRGB inputs rather than doing basic classic transforms like converting to YUV420. Everyone learns from the papers, so that's assumed to be the standard practice.
In my experience, staying close to the storage format is very useful because it allows the neural network to correctly deal with clipped/saturated values. If your file is saved in sRGB and you train in sRGB, then when something turns to 0 or 255, the AI can handle it as a special case because most likely it was too bright or too dark for your sensor to capture accurately. If you first transform to a different color space, that clear clip boundary gets lost.
Also, I would prefer sRGB oder RGB because it more closely matches the human vision system. That said, the RGB to YUV transformation is effectively a matrix multiplication, so if you use conv features like everyone then you can merge it into your weights for free.
Typical CNNs can learn a linear transformation of the input at no cost. Since YUV is such a linear transformation of RGB, there is no benefit in converting to it beforehand.
How is there not a cost associated with forcing the machine to learn how to do something that we already have a simple, deterministic algorithm for? Won't some engineer need to double check a few things with regard to the AI's idea of color space transform?
You could probably derive some smart initialization for the first layer of a NN based on domain knowledge (color spaces, sobel filters, etc.). But since this is such a small part of what the NN has to learn, I expect this to result in a small improvement in training time and have no effect on final performance and accuracy, so it's unlikely to be worth the complexity of developing such a feature.
Your instincts are correct. Training is faster, more stable, and more efficient that way. In certain cases it "pretty much is irrelevant" but the advantages of the strategy of modelling the knowns and training only on the unknowns becomes starkly apparent when doing e.g. sensor fusion or other ML tasks on physical systems.
The bitter lesson is about not trying to encode impossible-to-formalize conceptual knowledge, not avoiding data efficiency and the need to scale the model up to ever higher parameter counts.
If we followed this logic, we'd be training LLMs on character-level UTF-32 and just letting it figure everything out by itself, while needing two orders of magnitude bigger contexts and parameter counts.
Converting from RGB to YUV is absolutely subject to the bitter lesson because it is trying to generalize from a representation that we have seen works for some classical methods and hard code that knowledge in to the AI which could easily learn (and will anyways) a more useful representation for itself.
> LLMs on character-level UTF-32 and just letting it figure everything out by itself, while needing two orders of magnitude bigger contexts and parameter counts.
This was tried extensively and honestly it is probably still too early to proclaim the demise of this approach. It's also completely different - you're conflating a representation that literally changes the number of forward passes you have to do (ie. the amount of computation - what the bitter lesson is about) vs. one that (at most) would just require stacking on a few layers or so.
A better example for your point (imo) would be audio recognition, where we pre-transform from wave amplitudes into log mel spectrogram for ingestion by the model. I think this will ultimately fall to the bitter lesson as well though.
Also a key difference is that you are proposing going from methods that already work to try to inject more classical knowledge into them. It is oftentimes the case that you'll have an intermediary fusion between deep + classical, but not if you already have working fully deep methods.
Heck why even go that far? Given how much texts we have in scanned books, just feed it scans of the books and let it dedicate a bunch of layers to learning OCR.
Or given the number of unscanned books, even just give it the controls for a book scanner, the books and probably some robot arms. Then let it figure out the scanning first in some layers. Shouldn't be that hard.
Right... but I don't see how that means that it doesn't fall to the bitter lesson.
The bitter lesson is not saying that the model will always relearn the same representation as the one that has been useful to humans in the past, merely that the model will learn a better representation for the task at hand than the one hand-coded by humans.
If the model could easily learn the representation useful to humans, then it will fall to the bitter lesson because at minimum the model could easily follow our path (it's just an affine transformation to learn) and more probably will learn very different (& better) representations for itself.
LLMs can't reason about spelling, e.g. asking for a sentence which contains no letter "a"; and can also struggle with rhyming, etc. The most obvious explanation is that they never 'see' the underlying letters/spelling, only tokens.
Been hearing that for half my adult life. People were 100% sure multicore in 2005 meant manufacturers were officially signalling it and it was time to invest in auto-parallelizable code.
I don't think it's wrong, but looking at it through a child's eyes, we do keep finding ways to do things we couldn't a couple years ago: an open mind on hardware and more focus on software is continuing deep innovation cycles
Leaving aside that we're still far from hitting the limits to growth outlined in that book, and that we can exceed those limits to growth by expanding outside of Earth, what does a book about physical limitations on agriculture and industry have to do with limitations on computing efficiency? There is of course some fundamental limit to computing efficiency, but for all we know we could be many orders of magnitude away from hitting it.
We've clearly fallen behind the exponential curve on clock speed. But the great thing is we can parallelize transformers, so it's not as big of a deal.
It's also probably because a lot of knowledge about how jpeg works is tied up signal processing books that usually front load a bunch of mathematics as opposed to ML which often needs a bit of mathematical intuition but in practice is usually empirical.
I was trying to learn how the discrete cosine transform works, so I looked up some code in an open source program. The code said it copied it verbatim from a book from the 90s. I looked up the book and the book said it copied it verbatim from a paper from the 1970s.
We used to do a lot more preprocessing to NLP, like stemming, removing stop words, or even adding grammar information (NP, VP, etc.).
Now we just do basic tokenization. The rest turned out to be irrelevant or even counter productive.
But also, that basic tokenization is essential; training it on a raw ascii stream would be much less efficient. There is a sweet spot of processing & abstraction that should be aimed for.
Short text representations (via good tokenization) significantly reduces the computational cost of a transformer (need to generate fewer tokens for the same output length, and need fewer tokens to represent the same window size). I think these combine to n^3 scaling (n^2 from window size and n from output size).
For images it's not clear to me if there are any preprocessing methods that do a lot better than resizing the image to a smaller resolution (which is commonly done already).
> rather than doing basic classic transforms like converting to YUV420
What will converting to YUV420 achieve though, except for 4:2:0 chroma subsampling? YUV has little basis in human perception to begin with, it's a color television legacy model used for compression. There are much better models if you want to extract the perceptual information from the picture.
I would worry that the fixed, non-overlapping block nature of a JPEG would reduce translation invariance - shift an image by 4 pixels and the DCT coefficients may look very different. People have been doing a lot of work to try to reduce the dependence of the image on the actual pixel coordinates - see for example https://research.nvidia.com/publication/2021-12_alias-free-g...
JPEG-2000 uses wavelets as a decomposition basis as opposed to DCT which in theory makes it possible to treat the whole image as a single block while ensuring high compression. In practice though tiles are used, I would guess to improve on memory and compute parallelism.
In naive ML scenarios you are right. You can think of JPEG as an input embedding. One of many. The JPEG/spectral embedding is useful because it already provides miniature variational encoding that "makes sense" in terms of translation, sharpness, color, scale and texture.
But with clever ML you can design better variational characteristics such as rotation or nonlinear thing like faces, fingers, projections and abstract objects.
Further JPEG encoding/decoding will be an obstacle for many architectures that require gradients going back and forth between pixel space and JPEG in order to do evaluation steps and loss functions based on the pixel space (which would be superior). Not to mention if you need human feedback in generative scenarios to retouch the output and run training steps on the changed pixels.
And finally, there are already picture and video embeddings that are gradient-friendly and reusable.
>And finally, there are already picture and video embeddings that are gradient-friendly and reusable.
I have been thinking about such things for a while and considered things like giving each of R rows and each of C columns a vector, and using the inner product of row_i and col_i as that pixel's intensity (in the simplest demonstrative case monochromatic, but reordering the floats in each vector before taking the inner product allows many more channels).
But this is just my quick shallow concoction. If I look at the konicq10k dataset, there are 10373 images 1024 x 768 totaling to 5.3GB. Thats ~511KB per image. 511KB / ( 1024 + 768 ) = 285 bytes for each row or column. Dividing by 4 for standard floats that gives each column and each row a vector of 71 (32-bit) floats. This would use absolutely no prior knowledge about human visual perception, so fitting these float vectors inner products (and their permutations for different channels) to the image by the most naive metric (average per pixel residual error) will probably not result in great images. But I'm curious how bad it performs. Perhaps I will try it out in a few hours.
Do you have any references for such or similar simplistic embeddings? I don't want to force you to dig for me, but if you happen to know of a few such papers or perhaps even a review paper that would be welcome!
Not aware of this type of simplistic embeddings. I think getting the first few layers of a large pretrained vision model will get you better results. Blindly learning from a generic image would probably steer it toward the dominant textures and shapes rather than linear operation of camera moves.
The simplest embeddings for vision should focus on camera primitives and invariants. Translation, rotation, scale, skew, projections, lighting. It doesn't matter that much what you use in the layers, but you should steer the training with augmented data. Like rotate and skew the objects in the batches to make sure the layers are invariant to these things.
Next are some depth-mapping embeddings which go beyond flat camera awareness.
The best papers I've seen are face embeddings. You can get useful results with smaller models. There are of course deeper embeddings that focus on the whole scene and depth maps but those are huge.
I wondered awhile back about using EGA or CGA for doing image recognition and or for stable diffusion. seems like there should be more than enough image data for that resolution and color depth(16 colors).
I remember I've heard somewhere that our retina encodes the visuals it receives into a compressed signal before forwarding it to the visual cortex. If true, this may actually be how it's done "for real". ;)
> Why have your first layers of CNNs relearn what's already smartly encoded in the JPEG bits in the first place before it's blown into a bloated height x width x 3 float matrix?
Because data augmentation is much easier in the latter representation.
Also, if you rotate images as part of data augmentation, then that is already so expensive that any speedup from going directly to JPEG becomes negligible in comparison.
I'm one of the authors of this CVPR paper -- cool to see our work mentioned on HN!
The Uber paper from 2018 is one that has been floating around in the back of my head for a while. Decoding DCT to RGB is essentially an 8x8 stride 8 convolution -- it seems wasteful to perform this operation on CPU for data loading, then immediately pass the resulting decoded RGB into convolution layers that probably learn similar filters as those used during DCT decoding anyway.
Compared to the earlier Uber paper, our CVPR paper makes two big advances:
(1) Cleaner architecture: The Uber paper uses a CNN, while we use a ViT. It's kind of awkward to modify an existing CNN architecture to accept DCT instead of RGB since the grayscale data is 8x lower resolution than RGB, and the color information is 16x lower than RGB. With a CNN, you need to add extra layers to deal with the downsampled input, and use some kind of fusion mechanism to fuse the luma/chroma data of different resolution. With a ViT it's very straightforward to accept DCT input; you only need to change the patch embedding layer, and the body of the network is unchanged.
(2) Data augmentation: The original Uber paper only showed speedup during inference. During training they need to perform data augmentation, so convert DCT to RGB, augment in RGB, then convert back to DCT to feed the augmented data to the model. This means that their approach will be slower during training vs an RGB model. In our paper we show to to perform all standard image augmentations directly in DCT, so we can get speedups during both training and inference.
This makes sense in theory, but is hard to get working in practice.
We tried using nvjpeg to do JPEG decoding on GPU as a additional baseline, but using it as a drop-in replacement to a standard training pipeline gives huge slowdowns for a few reasons:
(1) Batching: nvjpeg isn't batched; you need to decode one at a time in a loop. This is slow but could in principle be improved with a better GPU decoder.
(2) Concurrent data loading / model execution: In a standard training pipeline, the CPU is loading and augmenting data on CPU for the next batch in parallel with the model running forward / backward on the current batch. Using the GPU for decoding blocks it from running the model concurrently. If you were careful I think you could probably find a way to interleave JPEG decoding and model execution on the GPU, but it's not straightforward. Just naively swapping out to use nvjpeg in a standard PyTorch training pipeline gives very bad performance.
(3) Data augmentation: If you do DCT -> RGB decoding on the GPU, then you have to think about how and where to do data augmentation. You can augment in DCT either on CPU or on GPU; however DCT augmentation tends to be more expensive than RGB augmentation (especially for resize operations), so if you are already going through the trouble of decoding to RGB then it's probably much cheaper to augment in RGB. If you augment in RGB on GPU, then you are blocking parallel model execution for both JPEG decoding and augmentation, and problem (2) gets even worse. If you do RGB augmentation on CPU, you end up with and extra GPU -> CPU -> GPU round trip on every model iteration which again reduces performance.
I'm just a low tier ML engineer, but I'd say you generally want to avoid splitting GPU resources over many libraries, to the extent it's even practically possible.
Ha, I remember the poster from the conference, it was quite crowded when I passed by. This one seemed to have a big focus on data augmentation in the DCT space. I was asking myself (and the author) whether you couldn’t eke out a little more efficiency by trying to quantize your network similarly to the default JPEG quantization table. As I understood, currently all weights are quantized uniformly, which does not make sense when your inputs are heavily quantized, does it? Maybe I should dive a little deeper into the Uber paper, they were focusing a bit more in the quantization part. Sorry if I’m talking nonsense, this is absolutely not my area, but I found the topic captivating.
Thank you. First published 2022-11-29 on arxiv [0] and updated one month ago.
Interesting line: "With these two improvements -- ViT and data augmentation -- we show that our ViT-Ti model achieves up to 39.2% faster training and 17.9% faster inference with no accuracy loss compared to the RGB counterpart."
> Accuracy gains are due primarily to the specific use of a DCT representation, which turns out to work curiously well for image classification.
It would seem quantization is a useful tool for any sort of NN-style application.
If the expected output is intended to be human-like, why not feed it information that a typical human could not distinguish from a lossless representation? Seems like a simple game of expectations and information theory.
That's kind of the key theory behind why JPEG (and other lossy encodings) work at all. A perfect being would see a JPEG next to a PNG or TIFF and find the first repugnantly error-ridden.
But we tend to ignore high-frequency data's specifics most of the time, so it psychologically works.
I often wonder though, what do my cat and dog hear when I'm playing compressed music? Does it sounds like a muddy phone call to them?
Audio is decidedly less "compressible" in human perceptual terms. The brain is amazingly skilled at detecting time delay and frequency deviations, so this perceptual baseline likely extends (mostly) to your pets.
You can fool the eyes a lot more easily. You can take away 50%+ or more of the color information before even a skilled artist will start noticing.
There are real differences in audio perception, though. Frequency range and sensitivity to different frequencies is a big difference in other animals; I would expect cats (who chase rodents, which often have very high pitched or even ultrasonic vocalizations) to be more sensitive to high frequencies than humans, and thus low passed / low sample rate audio could sounds 'bad.'
Another aspect is time resolution. Song birds can have 2-4x the time resolution of human hearing, which helps distinguish sounds in their very fast, complex calls. This may lead to better perception of artifacts in lossy coding schemes, but it's hard to say for sure.
True but hearing is logarithmic in both volume and frequency domains. Double the power does not equate to anything near double the loudness. Similarly each doubling of frequency is only one octave higher. Hearing up to 80khz doesn't mean hearing 4x more than humans... 10 octaves for humans, 12 octaves for cats. In a musical sense it probably isn't noticeable.
The extreme upper limit of human hearing is around 20khz, so cats really are hearing things that we don't, and for good reasons.
Sensitivity to different frequency ranges is more or less independent of anything else. Birds have heightened frequency response in the range they vocalize in, which helps them hear others if their species. Same for us; we vocalize at relatively low frequencies, so most of our hearing ability is focused on that range. There is also a range below which we don't hear: infrasound, which is utilized by elephants.
Logarithmic perception is certainly real, but the tuning of which frequency ranges an animal is more or less sensitive to is certainly species dependent.
As a comparison with removing the top two octaves from a cat's hearing, try removing the top two octaves from an audio file compared to your hearing range (lowpass at 5 kHz or less if you have hearing range loss, and/or resample to 10 kHz/ksps or less) and see if the results are musically noticeable. (At least for humans, the result is intelligible but heavily muffled, I can't speak for my pet cats though.)
That's fair... I'm just saying you shouldn't compare 20hz-20khz vs 20hz-80khz then decide cats can hear "three times" as much as humans. Two octaves is more than zero but a lot less than 3x.
while work has been done to characterize frequency sensitivity across species (which does vary quite a bit, especially in the higher ranges (>20khz)), i haven't seen any work that has been done to explore frequency domain perceptual masking curves in a cross species context.
since some species use their auditory systems for spatial localization, i would guess that the perceptual system would be totally different in those contexts.
No, audio compression doesn't filter out high frequencies, that's just what computer audio as a whole does. And I don't think there's enough of those high frequency components in what humans typically record for a cat or dog to notice the difference. As far as compression, the tricks that work on us should work on them.
the early xing mp3 codec famously cut everything off above 18khz, but that was out of spec. :)
instead perceptual audio compression typically filters out frequencies that neighbor other frequencies with lots of power. deleting these neighbors is called perceptual masking and to the best of my knowledge, we do not actually know if it works the same way in animal auditory systems.
>MP3 compression works by reducing (or approximating) the accuracy of certain components of sound that are considered (by psychoacoustic analysis) to be beyond the hearing capabilities of most humans.
-via Wikipedia
This holds true for most other audio compression as well.
Now, it's true that max recording frequency is bounded by sample rate via the Nyquist theorem, but that doesn't mean we're incapable of recording at higher fidelity - we just don't bother most of the time, because on consumer hardware it's going to be filtered out eventually anyway (or just not reproduced well enough, due to low-quality physical hardware). Recording studios will regularly produce masters that far exceed that normal hearing range though.
That is legitimate, but that's not the point here. The point is Uber's hubris. A hubris very useful in pumping up its stock price ahead of an IPO. If they had quietly planned to license it from the get-go, nobody would have mentioned it.
Any reason to prefer Waymo over Cruise? I saw an driverless Cruise taxi in SF just the other day. And are they any other competitors still in this space worth watching?
I’ve tried this idea in the past and found it not very useful in practice. It breaks when you want to add image augmentation during training, and JPEG is anyway a pretty lousy format for storing training samples if you care about saving space or retaining image quality.
Finally someone who did this, I've always thought this was a low hanging fruit. I wonder if you could make interesting and quickly trained diffusion models with this trick.
Note that it is from 2018. As someone here already mentioned there is a paper that applies the same idea to Vision Transformers published this year [1].
The quantization used for JPEG is optimized to throw away information in the frequency space that doesn't matter much to human perception, but I wonder if that is also optimal for training neutral networks?
Also, as far as I know, the human eye doesn't process images in blocks. I also wonder how blockless encoders such as JPEG 2000 would fare in this approach.
Might be related: https://arxiv.org/abs/2211.16421
It's a paper about directly learning via JPEG encodings which works well with visusal transformers' patch mechanism.
Training on pixels is inefficient. Why have your first layers of CNNs relearn what's already smartly encoded in the JPEG bits in the first place before it's blown into a bloated height x width x 3 float matrix?