It's more likely that such an architecture would be bigger rather than smaller. https://arxiv.org/abs/2412.20292 demonstrated that score-matching diffusion models approximate a process that combines patches from different training images. To build a model that makes use of this fact, all you need to do is look up the right patch in the training data. Of course a model the size of its training data would typically be rather unwieldy to use. If you want something smaller, we're back to approximations created by training the old-fashioned way.
I have mixed feelings about this interpretation: that diffusion models approximately produce moseics from patches of training data. It does a good job helping people understand why diffusion models are able to work. I used it myself in talk almost 3 years ago! And it isn't a lie exactly, the linked paper is totally sound. It's just that it only works if you assume your model is an absolute optimal minimization of the loss (under some inductive biases). It isn't. No machine learning more complicated than OLS holds up to that standard.
_And that's the actual reason they work._ Undefit models don't just approximate, they interpolate, extrapolate, generalize a bit, and ideally smooth out the occasional total garbage mixed in with your data. In fact, diffusion models work so well because they can correct their own garbage! If extra fingers start to show up in step 5, then steps 6 and 7 still have a chance to reinterpret that as noise and correct back into distribution.
And then there's all the stuff you can do with diffusion models. In my research I hack into the model and use it to decompose images into the surface material properties and lighting! That doesn't make much sense as averaging of memorized patches.
Given all that, it is a very useful interpretation. But I wouldn't take it too literally.
> I used it myself in talk almost 3 years ago! And it isn't a lie exactly, the linked paper is totally sound.
The paper was published in December last year and addresses your concerns head-on. For example, from the introduction:
"if the network can learn
this ideal score function exactly, then they will implement a perfect reversal of the forward process. This, in turn, will
only be able to turn Gaussian noise into memorized training
examples. Thus, any originality in the outputs of diffusion
models must lie in their failure to achieve the very objective they are trained on: learning the ideal score function.
But how can they fail in intelligent ways that lead to many sensible new examples far from the training set?"
Their answers to these questions are very good and also cover things like correcting the output of previous steps. But the proof is in the pudding: the outputs of their alternative procedure match the models they're explaining very well.
I encourage you to read it; maybe you'll even find a new way to decompose images into surface material properties and lighting as a result.
I did read it, all the way through! It's really good. The part you are quoting is setting up the ELS, which does not memorize entire images due to the inductive biases of a CNN (translation symmetry, limited receptive field). But the equivalence to a patch moseic is still due to the assumption that the loss is perfectly minimized under those restrictions.
And I was impressed by the close fit to real CNNs/ResNets and even to UNets. But what that shows is that the real models are heavily overfit. The datasets they are using for evaluation here are _tiny_.