Hacker News new | past | comments | ask | show | jobs | submit login

I was doing the same thing at Netflix around the same time as a 20% research project. Training GANs end2end directly in JPEG coeffs space (and then rebuild a JPEG from the generated coeffs using libjpeg to get an image). The pitch was that it not only worked, but you could get fast training by representing each JPEG block as a dense + sparse vector (dense for the low DCT coeffs, sparse for the high ones since they're ~all zeros) and using a neural network library with fast ops on sparse data.

Training on pixels is inefficient. Why have your first layers of CNNs relearn what's already smartly encoded in the JPEG bits in the first place before it's blown into a bloated height x width x 3 float matrix?




As an AI armchair quarterback, I've always held the opinion that the image ML space has a counter-productive bias towards not pre-processing images stemming from a mixture of test purity and academic hubris. "Look what this method can learn from raw data completely independently!" makes for a nice paper. So, they stick with sRGB inputs rather than doing basic classic transforms like converting to YUV420. Everyone learns from the papers, so that's assumed to be the standard practice.


In my experience, staying close to the storage format is very useful because it allows the neural network to correctly deal with clipped/saturated values. If your file is saved in sRGB and you train in sRGB, then when something turns to 0 or 255, the AI can handle it as a special case because most likely it was too bright or too dark for your sensor to capture accurately. If you first transform to a different color space, that clear clip boundary gets lost.

Also, I would prefer sRGB oder RGB because it more closely matches the human vision system. That said, the RGB to YUV transformation is effectively a matrix multiplication, so if you use conv features like everyone then you can merge it into your weights for free.


Would this apply to music formats as well ?


Typical CNNs can learn a linear transformation of the input at no cost. Since YUV is such a linear transformation of RGB, there is no benefit in converting to it beforehand.


How is there not a cost associated with forcing the machine to learn how to do something that we already have a simple, deterministic algorithm for? Won't some engineer need to double check a few things with regard to the AI's idea of color space transform?


You could probably derive some smart initialization for the first layer of a NN based on domain knowledge (color spaces, sobel filters, etc.). But since this is such a small part of what the NN has to learn, I expect this to result in a small improvement in training time and have no effect on final performance and accuracy, so it's unlikely to be worth the complexity of developing such a feature.


Absolutely this.

Seems like on HN people are still learning 'the bitter lesson'.


Amdahl’s law?



Thank you!


Sorry - should have included a cite.

That said, Amdahl's law is also probably related in some degree - I would view YUV conversion as an unnecessary optimization.


Your instincts are correct. Training is faster, more stable, and more efficient that way. In certain cases it "pretty much is irrelevant" but the advantages of the strategy of modelling the knowns and training only on the unknowns becomes starkly apparent when doing e.g. sensor fusion or other ML tasks on physical systems.


In deep ML, people are pretty familiar with the bitter lesson and don't want to waste time on this.


The bitter lesson is about not trying to encode impossible-to-formalize conceptual knowledge, not avoiding data efficiency and the need to scale the model up to ever higher parameter counts.

If we followed this logic, we'd be training LLMs on character-level UTF-32 and just letting it figure everything out by itself, while needing two orders of magnitude bigger contexts and parameter counts.


Converting from RGB to YUV is absolutely subject to the bitter lesson because it is trying to generalize from a representation that we have seen works for some classical methods and hard code that knowledge in to the AI which could easily learn (and will anyways) a more useful representation for itself.

> LLMs on character-level UTF-32 and just letting it figure everything out by itself, while needing two orders of magnitude bigger contexts and parameter counts.

This was tried extensively and honestly it is probably still too early to proclaim the demise of this approach. It's also completely different - you're conflating a representation that literally changes the number of forward passes you have to do (ie. the amount of computation - what the bitter lesson is about) vs. one that (at most) would just require stacking on a few layers or so.

A better example for your point (imo) would be audio recognition, where we pre-transform from wave amplitudes into log mel spectrogram for ingestion by the model. I think this will ultimately fall to the bitter lesson as well though.

Also a key difference is that you are proposing going from methods that already work to try to inject more classical knowledge into them. It is oftentimes the case that you'll have an intermediary fusion between deep + classical, but not if you already have working fully deep methods.


Heck why even go that far? Given how much texts we have in scanned books, just feed it scans of the books and let it dedicate a bunch of layers to learning OCR.


Or given the number of unscanned books, even just give it the controls for a book scanner, the books and probably some robot arms. Then let it figure out the scanning first in some layers. Shouldn't be that hard.


RGB->YUV is literally an affine transform, of course it falls to the bitter lesson.


Does it? Because I'm not sure the model has any intrinsic incentive to learn to follow how human perception works.


Right... but I don't see how that means that it doesn't fall to the bitter lesson.

The bitter lesson is not saying that the model will always relearn the same representation as the one that has been useful to humans in the past, merely that the model will learn a better representation for the task at hand than the one hand-coded by humans.

If the model could easily learn the representation useful to humans, then it will fall to the bitter lesson because at minimum the model could easily follow our path (it's just an affine transformation to learn) and more probably will learn very different (& better) representations for itself.


This will absolutely be the case N doublings of Moore's law from here. Tokens are information loss.


Information loss, or the result of useful computation? VAEs exist after all.


LLMs can't reason about spelling, e.g. asking for a sentence which contains no letter "a"; and can also struggle with rhyming, etc. The most obvious explanation is that they never 'see' the underlying letters/spelling, only tokens.


Keep in mind Moore's law is coming to its end.


Been hearing that for half my adult life. People were 100% sure multicore in 2005 meant manufacturers were officially signalling it and it was time to invest in auto-parallelizable code.

I don't think it's wrong, but looking at it through a child's eyes, we do keep finding ways to do things we couldn't a couple years ago: an open mind on hardware and more focus on software is continuing deep innovation cycles


There are limits to growth[1]. God-like tech utopia isn't and won't be real.

1: https://www.clubofrome.org/publication/the-limits-to-growth/


Leaving aside that we're still far from hitting the limits to growth outlined in that book, and that we can exceed those limits to growth by expanding outside of Earth, what does a book about physical limitations on agriculture and industry have to do with limitations on computing efficiency? There is of course some fundamental limit to computing efficiency, but for all we know we could be many orders of magnitude away from hitting it.


The original study has been studied again and it has proven true so far. An analysis: https://medium.com/@CollapseSurvival/overshoot-why-its-alrea... Humanity likely won’t ever be able to permanently settle outside earth.


Do I need to repeat myself? What do limits on agriculture have to do with limits on computing?


^ equivalent of ideological salesman ringing my doorbell. Absolutely nothing to do with anything I said.


We've clearly fallen behind the exponential curve on clock speed. But the great thing is we can parallelize transformers, so it's not as big of a deal.


If you want to play by the bitter lesson, why don't you just feed the raw JPEG bits into your neural network?


It's also probably because a lot of knowledge about how jpeg works is tied up signal processing books that usually front load a bunch of mathematics as opposed to ML which often needs a bit of mathematical intuition but in practice is usually empirical.


I was trying to learn how the discrete cosine transform works, so I looked up some code in an open source program. The code said it copied it verbatim from a book from the 90s. I looked up the book and the book said it copied it verbatim from a paper from the 1970s.


The code is irrelevant, in the eyes of the theoretician anyway.


I am theoretically challenged.


End to end ml models are the goal regardless of efficiency just like software engs aim for higher level interfaces


That's something that surprises me too, given the preprocessing applied to NLP.


We used to do a lot more preprocessing to NLP, like stemming, removing stop words, or even adding grammar information (NP, VP, etc.). Now we just do basic tokenization. The rest turned out to be irrelevant or even counter productive.


But also, that basic tokenization is essential; training it on a raw ascii stream would be much less efficient. There is a sweet spot of processing & abstraction that should be aimed for.


Short text representations (via good tokenization) significantly reduces the computational cost of a transformer (need to generate fewer tokens for the same output length, and need fewer tokens to represent the same window size). I think these combine to n^3 scaling (n^2 from window size and n from output size).

For images it's not clear to me if there are any preprocessing methods that do a lot better than resizing the image to a smaller resolution (which is commonly done already).


> rather than doing basic classic transforms like converting to YUV420

What will converting to YUV420 achieve though, except for 4:2:0 chroma subsampling? YUV has little basis in human perception to begin with, it's a color television legacy model used for compression. There are much better models if you want to extract the perceptual information from the picture.


I would worry that the fixed, non-overlapping block nature of a JPEG would reduce translation invariance - shift an image by 4 pixels and the DCT coefficients may look very different. People have been doing a lot of work to try to reduce the dependence of the image on the actual pixel coordinates - see for example https://research.nvidia.com/publication/2021-12_alias-free-g...


On the other hand, ViT uses non-overlapping patches anyway, so the impact may be minor. Example code: https://nn.labml.ai/transformers/vit/index.html


Does JPEG-2000 fix that? From what I gathered, it doesn't use blocks.


JPEG-2000 uses wavelets as a decomposition basis as opposed to DCT which in theory makes it possible to treat the whole image as a single block while ensuring high compression. In practice though tiles are used, I would guess to improve on memory and compute parallelism.


In naive ML scenarios you are right. You can think of JPEG as an input embedding. One of many. The JPEG/spectral embedding is useful because it already provides miniature variational encoding that "makes sense" in terms of translation, sharpness, color, scale and texture.

But with clever ML you can design better variational characteristics such as rotation or nonlinear thing like faces, fingers, projections and abstract objects.

Further JPEG encoding/decoding will be an obstacle for many architectures that require gradients going back and forth between pixel space and JPEG in order to do evaluation steps and loss functions based on the pixel space (which would be superior). Not to mention if you need human feedback in generative scenarios to retouch the output and run training steps on the changed pixels.

And finally, there are already picture and video embeddings that are gradient-friendly and reusable.


>And finally, there are already picture and video embeddings that are gradient-friendly and reusable.

I have been thinking about such things for a while and considered things like giving each of R rows and each of C columns a vector, and using the inner product of row_i and col_i as that pixel's intensity (in the simplest demonstrative case monochromatic, but reordering the floats in each vector before taking the inner product allows many more channels).

But this is just my quick shallow concoction. If I look at the konicq10k dataset, there are 10373 images 1024 x 768 totaling to 5.3GB. Thats ~511KB per image. 511KB / ( 1024 + 768 ) = 285 bytes for each row or column. Dividing by 4 for standard floats that gives each column and each row a vector of 71 (32-bit) floats. This would use absolutely no prior knowledge about human visual perception, so fitting these float vectors inner products (and their permutations for different channels) to the image by the most naive metric (average per pixel residual error) will probably not result in great images. But I'm curious how bad it performs. Perhaps I will try it out in a few hours.

Do you have any references for such or similar simplistic embeddings? I don't want to force you to dig for me, but if you happen to know of a few such papers or perhaps even a review paper that would be welcome!


Not aware of this type of simplistic embeddings. I think getting the first few layers of a large pretrained vision model will get you better results. Blindly learning from a generic image would probably steer it toward the dominant textures and shapes rather than linear operation of camera moves.

The simplest embeddings for vision should focus on camera primitives and invariants. Translation, rotation, scale, skew, projections, lighting. It doesn't matter that much what you use in the layers, but you should steer the training with augmented data. Like rotate and skew the objects in the batches to make sure the layers are invariant to these things.

Next are some depth-mapping embeddings which go beyond flat camera awareness.

The best papers I've seen are face embeddings. You can get useful results with smaller models. There are of course deeper embeddings that focus on the whole scene and depth maps but those are huge.


I wondered awhile back about using EGA or CGA for doing image recognition and or for stable diffusion. seems like there should be more than enough image data for that resolution and color depth(16 colors).


I remember I've heard somewhere that our retina encodes the visuals it receives into a compressed signal before forwarding it to the visual cortex. If true, this may actually be how it's done "for real". ;)


The retina does a lot of processing.


> Why have your first layers of CNNs relearn what's already smartly encoded in the JPEG bits in the first place before it's blown into a bloated height x width x 3 float matrix?

Because data augmentation is much easier in the latter representation.

Also, if you rotate images as part of data augmentation, then that is already so expensive that any speedup from going directly to JPEG becomes negligible in comparison.


Fascinating.

How did this approach handle the same image being encoded in different ways by different JPEG libraries? Or just with different quality settings?


Seems like you are using similar tricks what compressed sensing people do, work with data in a sparser domain.


This is really interesting. Have you published your research anywhere that I could read?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: