The convolution kernels in the first layers of AlexNet and all its DL image processing descendants converge to the Gabor filters (or some variation of) which are the response functions of the neurons in the first layer of the visual cortex. About 15 years before AlexNet there were works showing that such type of filter is kind of mathematically optimally encoding for the feature based image processing. (So, theoretically one could have just pre-generated the first layers in the net and use them fixed thus cutting significant time/effort on the training - i myself wanted to do it 20 years ago, yet just didn't get to it :)
I'm pretty sure that for middle layers in image DL as well as for transformers in lang we have a kind of similar optimality, ie. something like maximum entropy filter(separator/aggregator at higher levels?) at a given level of granularity/scale like Gabors at the first feature level.
> the first layers of AlexNet and all its DL image processing descendants converge to the Gabor filters (or some variation of)
and
> 15 years before AlexNet there were works showing that such type of filter is kind of mathematically optimally encoding for the feature based image processing
For the first - you can just look at the original AlexNet paper. The kernels are unmistakably strikingly Gabor-like. Some differences, like cross-color, are kind of giving rise to possibly interesting questions - is it improvement or deficiency(i.e. more training would correct) over biology? or may be it is just real-valued projection from the [plausible] fact that the optimal is complex-valued?
>?
I don't have that specific reference i had in mind that was published 15-20 years ago, yet you can trace that line of thought development through the works like these for example (there have been a bunch of them in the 199x and into 200x) :
I'm pretty sure that for middle layers in image DL as well as for transformers in lang we have a kind of similar optimality, ie. something like maximum entropy filter(separator/aggregator at higher levels?) at a given level of granularity/scale like Gabors at the first feature level.