Hacker News new | past | comments | ask | show | jobs | submit login
Faster neural networks straight from JPEG (2018) (uber.com)
253 points by Anon84 on July 13, 2023 | hide | past | favorite | 104 comments



I was doing the same thing at Netflix around the same time as a 20% research project. Training GANs end2end directly in JPEG coeffs space (and then rebuild a JPEG from the generated coeffs using libjpeg to get an image). The pitch was that it not only worked, but you could get fast training by representing each JPEG block as a dense + sparse vector (dense for the low DCT coeffs, sparse for the high ones since they're ~all zeros) and using a neural network library with fast ops on sparse data.

Training on pixels is inefficient. Why have your first layers of CNNs relearn what's already smartly encoded in the JPEG bits in the first place before it's blown into a bloated height x width x 3 float matrix?


As an AI armchair quarterback, I've always held the opinion that the image ML space has a counter-productive bias towards not pre-processing images stemming from a mixture of test purity and academic hubris. "Look what this method can learn from raw data completely independently!" makes for a nice paper. So, they stick with sRGB inputs rather than doing basic classic transforms like converting to YUV420. Everyone learns from the papers, so that's assumed to be the standard practice.


In my experience, staying close to the storage format is very useful because it allows the neural network to correctly deal with clipped/saturated values. If your file is saved in sRGB and you train in sRGB, then when something turns to 0 or 255, the AI can handle it as a special case because most likely it was too bright or too dark for your sensor to capture accurately. If you first transform to a different color space, that clear clip boundary gets lost.

Also, I would prefer sRGB oder RGB because it more closely matches the human vision system. That said, the RGB to YUV transformation is effectively a matrix multiplication, so if you use conv features like everyone then you can merge it into your weights for free.


Would this apply to music formats as well ?


Typical CNNs can learn a linear transformation of the input at no cost. Since YUV is such a linear transformation of RGB, there is no benefit in converting to it beforehand.


How is there not a cost associated with forcing the machine to learn how to do something that we already have a simple, deterministic algorithm for? Won't some engineer need to double check a few things with regard to the AI's idea of color space transform?


You could probably derive some smart initialization for the first layer of a NN based on domain knowledge (color spaces, sobel filters, etc.). But since this is such a small part of what the NN has to learn, I expect this to result in a small improvement in training time and have no effect on final performance and accuracy, so it's unlikely to be worth the complexity of developing such a feature.


Absolutely this.

Seems like on HN people are still learning 'the bitter lesson'.


Amdahl’s law?



Thank you!


Sorry - should have included a cite.

That said, Amdahl's law is also probably related in some degree - I would view YUV conversion as an unnecessary optimization.


Your instincts are correct. Training is faster, more stable, and more efficient that way. In certain cases it "pretty much is irrelevant" but the advantages of the strategy of modelling the knowns and training only on the unknowns becomes starkly apparent when doing e.g. sensor fusion or other ML tasks on physical systems.


In deep ML, people are pretty familiar with the bitter lesson and don't want to waste time on this.


The bitter lesson is about not trying to encode impossible-to-formalize conceptual knowledge, not avoiding data efficiency and the need to scale the model up to ever higher parameter counts.

If we followed this logic, we'd be training LLMs on character-level UTF-32 and just letting it figure everything out by itself, while needing two orders of magnitude bigger contexts and parameter counts.


Converting from RGB to YUV is absolutely subject to the bitter lesson because it is trying to generalize from a representation that we have seen works for some classical methods and hard code that knowledge in to the AI which could easily learn (and will anyways) a more useful representation for itself.

> LLMs on character-level UTF-32 and just letting it figure everything out by itself, while needing two orders of magnitude bigger contexts and parameter counts.

This was tried extensively and honestly it is probably still too early to proclaim the demise of this approach. It's also completely different - you're conflating a representation that literally changes the number of forward passes you have to do (ie. the amount of computation - what the bitter lesson is about) vs. one that (at most) would just require stacking on a few layers or so.

A better example for your point (imo) would be audio recognition, where we pre-transform from wave amplitudes into log mel spectrogram for ingestion by the model. I think this will ultimately fall to the bitter lesson as well though.

Also a key difference is that you are proposing going from methods that already work to try to inject more classical knowledge into them. It is oftentimes the case that you'll have an intermediary fusion between deep + classical, but not if you already have working fully deep methods.


Heck why even go that far? Given how much texts we have in scanned books, just feed it scans of the books and let it dedicate a bunch of layers to learning OCR.


Or given the number of unscanned books, even just give it the controls for a book scanner, the books and probably some robot arms. Then let it figure out the scanning first in some layers. Shouldn't be that hard.


RGB->YUV is literally an affine transform, of course it falls to the bitter lesson.


Does it? Because I'm not sure the model has any intrinsic incentive to learn to follow how human perception works.


Right... but I don't see how that means that it doesn't fall to the bitter lesson.

The bitter lesson is not saying that the model will always relearn the same representation as the one that has been useful to humans in the past, merely that the model will learn a better representation for the task at hand than the one hand-coded by humans.

If the model could easily learn the representation useful to humans, then it will fall to the bitter lesson because at minimum the model could easily follow our path (it's just an affine transformation to learn) and more probably will learn very different (& better) representations for itself.


This will absolutely be the case N doublings of Moore's law from here. Tokens are information loss.


Information loss, or the result of useful computation? VAEs exist after all.


LLMs can't reason about spelling, e.g. asking for a sentence which contains no letter "a"; and can also struggle with rhyming, etc. The most obvious explanation is that they never 'see' the underlying letters/spelling, only tokens.


Keep in mind Moore's law is coming to its end.


Been hearing that for half my adult life. People were 100% sure multicore in 2005 meant manufacturers were officially signalling it and it was time to invest in auto-parallelizable code.

I don't think it's wrong, but looking at it through a child's eyes, we do keep finding ways to do things we couldn't a couple years ago: an open mind on hardware and more focus on software is continuing deep innovation cycles


There are limits to growth[1]. God-like tech utopia isn't and won't be real.

1: https://www.clubofrome.org/publication/the-limits-to-growth/


Leaving aside that we're still far from hitting the limits to growth outlined in that book, and that we can exceed those limits to growth by expanding outside of Earth, what does a book about physical limitations on agriculture and industry have to do with limitations on computing efficiency? There is of course some fundamental limit to computing efficiency, but for all we know we could be many orders of magnitude away from hitting it.


The original study has been studied again and it has proven true so far. An analysis: https://medium.com/@CollapseSurvival/overshoot-why-its-alrea... Humanity likely won’t ever be able to permanently settle outside earth.


Do I need to repeat myself? What do limits on agriculture have to do with limits on computing?


^ equivalent of ideological salesman ringing my doorbell. Absolutely nothing to do with anything I said.


We've clearly fallen behind the exponential curve on clock speed. But the great thing is we can parallelize transformers, so it's not as big of a deal.


If you want to play by the bitter lesson, why don't you just feed the raw JPEG bits into your neural network?


It's also probably because a lot of knowledge about how jpeg works is tied up signal processing books that usually front load a bunch of mathematics as opposed to ML which often needs a bit of mathematical intuition but in practice is usually empirical.


I was trying to learn how the discrete cosine transform works, so I looked up some code in an open source program. The code said it copied it verbatim from a book from the 90s. I looked up the book and the book said it copied it verbatim from a paper from the 1970s.


The code is irrelevant, in the eyes of the theoretician anyway.


I am theoretically challenged.


End to end ml models are the goal regardless of efficiency just like software engs aim for higher level interfaces


That's something that surprises me too, given the preprocessing applied to NLP.


We used to do a lot more preprocessing to NLP, like stemming, removing stop words, or even adding grammar information (NP, VP, etc.). Now we just do basic tokenization. The rest turned out to be irrelevant or even counter productive.


But also, that basic tokenization is essential; training it on a raw ascii stream would be much less efficient. There is a sweet spot of processing & abstraction that should be aimed for.


Short text representations (via good tokenization) significantly reduces the computational cost of a transformer (need to generate fewer tokens for the same output length, and need fewer tokens to represent the same window size). I think these combine to n^3 scaling (n^2 from window size and n from output size).

For images it's not clear to me if there are any preprocessing methods that do a lot better than resizing the image to a smaller resolution (which is commonly done already).


> rather than doing basic classic transforms like converting to YUV420

What will converting to YUV420 achieve though, except for 4:2:0 chroma subsampling? YUV has little basis in human perception to begin with, it's a color television legacy model used for compression. There are much better models if you want to extract the perceptual information from the picture.


I would worry that the fixed, non-overlapping block nature of a JPEG would reduce translation invariance - shift an image by 4 pixels and the DCT coefficients may look very different. People have been doing a lot of work to try to reduce the dependence of the image on the actual pixel coordinates - see for example https://research.nvidia.com/publication/2021-12_alias-free-g...


On the other hand, ViT uses non-overlapping patches anyway, so the impact may be minor. Example code: https://nn.labml.ai/transformers/vit/index.html


Does JPEG-2000 fix that? From what I gathered, it doesn't use blocks.


JPEG-2000 uses wavelets as a decomposition basis as opposed to DCT which in theory makes it possible to treat the whole image as a single block while ensuring high compression. In practice though tiles are used, I would guess to improve on memory and compute parallelism.


In naive ML scenarios you are right. You can think of JPEG as an input embedding. One of many. The JPEG/spectral embedding is useful because it already provides miniature variational encoding that "makes sense" in terms of translation, sharpness, color, scale and texture.

But with clever ML you can design better variational characteristics such as rotation or nonlinear thing like faces, fingers, projections and abstract objects.

Further JPEG encoding/decoding will be an obstacle for many architectures that require gradients going back and forth between pixel space and JPEG in order to do evaluation steps and loss functions based on the pixel space (which would be superior). Not to mention if you need human feedback in generative scenarios to retouch the output and run training steps on the changed pixels.

And finally, there are already picture and video embeddings that are gradient-friendly and reusable.


>And finally, there are already picture and video embeddings that are gradient-friendly and reusable.

I have been thinking about such things for a while and considered things like giving each of R rows and each of C columns a vector, and using the inner product of row_i and col_i as that pixel's intensity (in the simplest demonstrative case monochromatic, but reordering the floats in each vector before taking the inner product allows many more channels).

But this is just my quick shallow concoction. If I look at the konicq10k dataset, there are 10373 images 1024 x 768 totaling to 5.3GB. Thats ~511KB per image. 511KB / ( 1024 + 768 ) = 285 bytes for each row or column. Dividing by 4 for standard floats that gives each column and each row a vector of 71 (32-bit) floats. This would use absolutely no prior knowledge about human visual perception, so fitting these float vectors inner products (and their permutations for different channels) to the image by the most naive metric (average per pixel residual error) will probably not result in great images. But I'm curious how bad it performs. Perhaps I will try it out in a few hours.

Do you have any references for such or similar simplistic embeddings? I don't want to force you to dig for me, but if you happen to know of a few such papers or perhaps even a review paper that would be welcome!


Not aware of this type of simplistic embeddings. I think getting the first few layers of a large pretrained vision model will get you better results. Blindly learning from a generic image would probably steer it toward the dominant textures and shapes rather than linear operation of camera moves.

The simplest embeddings for vision should focus on camera primitives and invariants. Translation, rotation, scale, skew, projections, lighting. It doesn't matter that much what you use in the layers, but you should steer the training with augmented data. Like rotate and skew the objects in the batches to make sure the layers are invariant to these things.

Next are some depth-mapping embeddings which go beyond flat camera awareness.

The best papers I've seen are face embeddings. You can get useful results with smaller models. There are of course deeper embeddings that focus on the whole scene and depth maps but those are huge.


I wondered awhile back about using EGA or CGA for doing image recognition and or for stable diffusion. seems like there should be more than enough image data for that resolution and color depth(16 colors).


I remember I've heard somewhere that our retina encodes the visuals it receives into a compressed signal before forwarding it to the visual cortex. If true, this may actually be how it's done "for real". ;)


The retina does a lot of processing.


> Why have your first layers of CNNs relearn what's already smartly encoded in the JPEG bits in the first place before it's blown into a bloated height x width x 3 float matrix?

Because data augmentation is much easier in the latter representation.

Also, if you rotate images as part of data augmentation, then that is already so expensive that any speedup from going directly to JPEG becomes negligible in comparison.


Fascinating.

How did this approach handle the same image being encoded in different ways by different JPEG libraries? Or just with different quality settings?


Seems like you are using similar tricks what compressed sensing people do, work with data in a sparser domain.


This is really interesting. Have you published your research anywhere that I could read?


For those interested, a modern version (vision transformers) was just published this year at CVPR https://openaccess.thecvf.com/content/CVPR2023/html/Park_RGB...


I'm one of the authors of this CVPR paper -- cool to see our work mentioned on HN!

The Uber paper from 2018 is one that has been floating around in the back of my head for a while. Decoding DCT to RGB is essentially an 8x8 stride 8 convolution -- it seems wasteful to perform this operation on CPU for data loading, then immediately pass the resulting decoded RGB into convolution layers that probably learn similar filters as those used during DCT decoding anyway.

Compared to the earlier Uber paper, our CVPR paper makes two big advances:

(1) Cleaner architecture: The Uber paper uses a CNN, while we use a ViT. It's kind of awkward to modify an existing CNN architecture to accept DCT instead of RGB since the grayscale data is 8x lower resolution than RGB, and the color information is 16x lower than RGB. With a CNN, you need to add extra layers to deal with the downsampled input, and use some kind of fusion mechanism to fuse the luma/chroma data of different resolution. With a ViT it's very straightforward to accept DCT input; you only need to change the patch embedding layer, and the body of the network is unchanged.

(2) Data augmentation: The original Uber paper only showed speedup during inference. During training they need to perform data augmentation, so convert DCT to RGB, augment in RGB, then convert back to DCT to feed the augmented data to the model. This means that their approach will be slower during training vs an RGB model. In our paper we show to to perform all standard image augmentations directly in DCT, so we can get speedups during both training and inference.

Happy to answer any questions about the project!


> Decoding DCT to RGB is essentially an 8x8 stride 8 convolution -- it seems wasteful to perform this operation on CPU for data loading

Then why not do it on the GPU? Feels like exactly the sort of thing it was designed to do.

Or alternatively, use nvjpeg?


This makes sense in theory, but is hard to get working in practice.

We tried using nvjpeg to do JPEG decoding on GPU as a additional baseline, but using it as a drop-in replacement to a standard training pipeline gives huge slowdowns for a few reasons:

(1) Batching: nvjpeg isn't batched; you need to decode one at a time in a loop. This is slow but could in principle be improved with a better GPU decoder.

(2) Concurrent data loading / model execution: In a standard training pipeline, the CPU is loading and augmenting data on CPU for the next batch in parallel with the model running forward / backward on the current batch. Using the GPU for decoding blocks it from running the model concurrently. If you were careful I think you could probably find a way to interleave JPEG decoding and model execution on the GPU, but it's not straightforward. Just naively swapping out to use nvjpeg in a standard PyTorch training pipeline gives very bad performance.

(3) Data augmentation: If you do DCT -> RGB decoding on the GPU, then you have to think about how and where to do data augmentation. You can augment in DCT either on CPU or on GPU; however DCT augmentation tends to be more expensive than RGB augmentation (especially for resize operations), so if you are already going through the trouble of decoding to RGB then it's probably much cheaper to augment in RGB. If you augment in RGB on GPU, then you are blocking parallel model execution for both JPEG decoding and augmentation, and problem (2) gets even worse. If you do RGB augmentation on CPU, you end up with and extra GPU -> CPU -> GPU round trip on every model iteration which again reduces performance.


I'm just a low tier ML engineer, but I'd say you generally want to avoid splitting GPU resources over many libraries, to the extent it's even practically possible.


Could you parallelise your parallel processors? ie. offload this work to a separate, (perhaps not even as beefy) GPU.

Akin to streamers having one GPU that they use for gaming and a second GPU used for encoding their stream.


Ha, I remember the poster from the conference, it was quite crowded when I passed by. This one seemed to have a big focus on data augmentation in the DCT space. I was asking myself (and the author) whether you couldn’t eke out a little more efficiency by trying to quantize your network similarly to the default JPEG quantization table. As I understood, currently all weights are quantized uniformly, which does not make sense when your inputs are heavily quantized, does it? Maybe I should dive a little deeper into the Uber paper, they were focusing a bit more in the quantization part. Sorry if I’m talking nonsense, this is absolutely not my area, but I found the topic captivating.


There is some work on using JPEG style DCT blocks for quantization: https://dl.acm.org/doi/10.1109/ISCA45697.2020.00075

(Disclaimer, not mine, but a friends work)


Thank you. First published 2022-11-29 on arxiv [0] and updated one month ago.

Interesting line: "With these two improvements -- ViT and data augmentation -- we show that our ViT-Ti model achieves up to 39.2% faster training and 17.9% faster inference with no accuracy loss compared to the RGB counterpart."

[0] https://arxiv.org/abs/2211.16421v2


Transformers are the taking the cake in the Deep Learning community


> Accuracy gains are due primarily to the specific use of a DCT representation, which turns out to work curiously well for image classification.

It would seem quantization is a useful tool for any sort of NN-style application.

If the expected output is intended to be human-like, why not feed it information that a typical human could not distinguish from a lossless representation? Seems like a simple game of expectations and information theory.


That's kind of the key theory behind why JPEG (and other lossy encodings) work at all. A perfect being would see a JPEG next to a PNG or TIFF and find the first repugnantly error-ridden.

But we tend to ignore high-frequency data's specifics most of the time, so it psychologically works.

I often wonder though, what do my cat and dog hear when I'm playing compressed music? Does it sounds like a muddy phone call to them?


> Does it sounds like a muddy phone call to them?

Likely no.

Audio is decidedly less "compressible" in human perceptual terms. The brain is amazingly skilled at detecting time delay and frequency deviations, so this perceptual baseline likely extends (mostly) to your pets.

You can fool the eyes a lot more easily. You can take away 50%+ or more of the color information before even a skilled artist will start noticing.


There are real differences in audio perception, though. Frequency range and sensitivity to different frequencies is a big difference in other animals; I would expect cats (who chase rodents, which often have very high pitched or even ultrasonic vocalizations) to be more sensitive to high frequencies than humans, and thus low passed / low sample rate audio could sounds 'bad.'

Another aspect is time resolution. Song birds can have 2-4x the time resolution of human hearing, which helps distinguish sounds in their very fast, complex calls. This may lead to better perception of artifacts in lossy coding schemes, but it's hard to say for sure.

Edit: reference on cat hearing: https://pubmed.ncbi.nlm.nih.gov/4066516

The hearing range of the cat for sounds of 70 dB SPL extends from 48 Hz to 85 kHz, giving it one of the broadest hearing ranges among mammals.


True but hearing is logarithmic in both volume and frequency domains. Double the power does not equate to anything near double the loudness. Similarly each doubling of frequency is only one octave higher. Hearing up to 80khz doesn't mean hearing 4x more than humans... 10 octaves for humans, 12 octaves for cats. In a musical sense it probably isn't noticeable.


The extreme upper limit of human hearing is around 20khz, so cats really are hearing things that we don't, and for good reasons.

Sensitivity to different frequency ranges is more or less independent of anything else. Birds have heightened frequency response in the range they vocalize in, which helps them hear others if their species. Same for us; we vocalize at relatively low frequencies, so most of our hearing ability is focused on that range. There is also a range below which we don't hear: infrasound, which is utilized by elephants.

Logarithmic perception is certainly real, but the tuning of which frequency ranges an animal is more or less sensitive to is certainly species dependent.


As a comparison with removing the top two octaves from a cat's hearing, try removing the top two octaves from an audio file compared to your hearing range (lowpass at 5 kHz or less if you have hearing range loss, and/or resample to 10 kHz/ksps or less) and see if the results are musically noticeable. (At least for humans, the result is intelligible but heavily muffled, I can't speak for my pet cats though.)


That's fair... I'm just saying you shouldn't compare 20hz-20khz vs 20hz-80khz then decide cats can hear "three times" as much as humans. Two octaves is more than zero but a lot less than 3x.


while work has been done to characterize frequency sensitivity across species (which does vary quite a bit, especially in the higher ranges (>20khz)), i haven't seen any work that has been done to explore frequency domain perceptual masking curves in a cross species context.

since some species use their auditory systems for spatial localization, i would guess that the perceptual system would be totally different in those contexts.


No, audio compression doesn't filter out high frequencies, that's just what computer audio as a whole does. And I don't think there's enough of those high frequency components in what humans typically record for a cat or dog to notice the difference. As far as compression, the tricks that work on us should work on them.


the early xing mp3 codec famously cut everything off above 18khz, but that was out of spec. :)

instead perceptual audio compression typically filters out frequencies that neighbor other frequencies with lots of power. deleting these neighbors is called perceptual masking and to the best of my knowledge, we do not actually know if it works the same way in animal auditory systems.


>MP3 compression works by reducing (or approximating) the accuracy of certain components of sound that are considered (by psychoacoustic analysis) to be beyond the hearing capabilities of most humans.

-via Wikipedia

This holds true for most other audio compression as well.

Now, it's true that max recording frequency is bounded by sample rate via the Nyquist theorem, but that doesn't mean we're incapable of recording at higher fidelity - we just don't bother most of the time, because on consumer hardware it's going to be filtered out eventually anyway (or just not reproduced well enough, due to low-quality physical hardware). Recording studios will regularly produce masters that far exceed that normal hearing range though.


Add "(2018)"


Back when Uber was thinking it could make self driving taxis


Uber now has a deal with Waymo - so in effect they are still trying to do self-driving taxi's, but now via a complex business relationship.


Ah, just like I "make hamburgers" when I go through the McDonals drive-through, although it's via a business transaction.


God this website is cynical. Licensing technology from another company to commercialize it is completely legitimate.


That is legitimate, but that's not the point here. The point is Uber's hubris. A hubris very useful in pumping up its stock price ahead of an IPO. If they had quietly planned to license it from the get-go, nobody would have mentioned it.


Uber literally killed people trying to develop their own first. Tesla continues to do so. I wish everyone would license Waymo instead.


Any reason to prefer Waymo over Cruise? I saw an driverless Cruise taxi in SF just the other day. And are they any other competitors still in this space worth watching?


They had the Uber ATG group - they laid everyone off and gave up on their AV research. How is that cynical? That's a fact.


More like making hamburgers by buying frozen burger patties at the supermarket.


I was wondering about "jpeg turned 25 in 2017." Seemed like peculiar phrasing.


IIRC, “neural network” style systems design started in the realm of computer vision with “Perceptrons”:

https://en.wikipedia.org/wiki/Perceptrons_(book)Perceptrons = https://g.co/kgs/8Un4eW

Makes sense that image processing would be a good fit in some cases.


That seems just in line with an earlier front page submission that used knn and gzipped text for categorization.


I’ve tried this idea in the past and found it not very useful in practice. It breaks when you want to add image augmentation during training, and JPEG is anyway a pretty lousy format for storing training samples if you care about saving space or retaining image quality.


Finally someone who did this, I've always thought this was a low hanging fruit. I wonder if you could make interesting and quickly trained diffusion models with this trick.


Note that it is from 2018. As someone here already mentioned there is a paper that applies the same idea to Vision Transformers published this year [1].

[1] https://openaccess.thecvf.com/content/CVPR2023/papers/Park_R...


Sorta of? Latent diffusion models operate in a compressed latent space, which is just a richer / learnable representation than DCT.


The quantization used for JPEG is optimized to throw away information in the frequency space that doesn't matter much to human perception, but I wonder if that is also optimal for training neutral networks?

Also, as far as I know, the human eye doesn't process images in blocks. I also wonder how blockless encoders such as JPEG 2000 would fare in this approach.


i guess an interesting question is: can you coax a network into learning a better perceptual compression transform than the dcts in jpeg?


Since convolution in the spatial domain is simply multiplication in the frequency domain it seems that this would be wayyy faster.


Might be related: https://arxiv.org/abs/2211.16421 It's a paper about directly learning via JPEG encodings which works well with visusal transformers' patch mechanism.



A lot of generative audio works very similar these days: it's much faster to predict and generate a codebook than a raw waveform.


I've heard of this long before 2018, was this really seen as novel in 2018?


The next wave of ai magic will come from domain expertise.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: