Absolutely, modern NN architectures have been inspired by biological ones- despite their massive differences.
Even in cases like attention, the modern version (that actually works in GPT-3, AlphaFold2, etc), has little in common with both the english word and what we think of as attention. Its a formula with two matmuls and a softmax: softmax(AB)C. In particular, it doesn't necessarily look anywhere at all- just a weighted sum of the inputs. Nothing like the hard attention used by the human visual cortex.
Its not even that different from a convolution where you allow the weights to be a function of the input.
So the inspiration might have come from humans, but the actual architectures have largely come from pure trial and error, with limited, difficult to explain intuition on what tends to work.
,,This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention
layer with sufficient number of heads is at least as expressive as any convolutional
layer. Our numerical experiments then show that self-attention layers attend to
pixel-grid patterns similarly to CNN layers, corroborating our analysis.''
Most definitely, what I meant though is how do we know mental attention is not essentially a big softmax (over some context, e.g. short or long term memory)?
The eye physically focusing on one thing at time seems like a special case (and this fact isn't even true of all animals, e.g. many prey species), and not a part of the brain's attention mechanism.
Actually the retinal anisotropy is a feature distinct from attention. You can fixate on a point in a scene and then attend any other point in the scene. This is indeed how both animal and human attention experiments in the field is setup to control for the eyes movements.
Even in cases like attention, the modern version (that actually works in GPT-3, AlphaFold2, etc), has little in common with both the english word and what we think of as attention. Its a formula with two matmuls and a softmax: softmax(AB)C. In particular, it doesn't necessarily look anywhere at all- just a weighted sum of the inputs. Nothing like the hard attention used by the human visual cortex. Its not even that different from a convolution where you allow the weights to be a function of the input.
So the inspiration might have come from humans, but the actual architectures have largely come from pure trial and error, with limited, difficult to explain intuition on what tends to work.