> *the answers to your question will get likely be still dependent on the nature...

> the answers to your question will get likely be still dependent on the nature of the process generating the data and hence they would still have to be answered empirically.

And I think that would be perfectly fine, or rather weird if otherwise. Part(*) of the unpredictability of ML models stems from the fact that the training data is unpredictable.

What is missing for me so far are more detailed explanations how the training data and task would influence specific decisions in model architecture. So I wouldn't expect a hard answer in the sense of "always use this architecture or that amount of neurons" but rather more insight what effects a specific architecture would have on the model.

E.g. every ML 101 course teaches the difference between single-layer and "multi"-layer (usually 2-layer) perceptrons: Linear separability, XOR problem etc.

But I haven't seen a lot of resources about e.g. the differences between 2- and 3-layer perceptrons, or 3- and 32-layer, etc. Similarly, how are your model capabilities influenced by the number of neurons inside a layer, or for convolutional layers, by parameters such as kernel dimensions, stride dimensions, etc? Same for transformers: What effects do embedding size, number of attention heads and number of consecutive transformer layers have on the model's abilities? How do I determine good values?

I don't want absolute numbers here, but rather any kind of understanding at all how to choose those numbers.

(There are some great answers in this thread already)

(* part of it, not all. I'm starting to get annoyed by the "culture" of ML algorithm design that seems to love throwing in additional sources of randomness and nondeterminism whenever they don't have a good idea what to do otherwise: Randomly shuffling/splitting the training data, random initialization of weights, random neuron/layer dropouts, random jumps during gradient descent, etc etc. All fine if you only care about statistics and probability distributions, but horrible if you want to debug a specific training setup or understand why your model learned some specific behavior).