Seems to be cool, but, one of thing that most annoys me on studying machine learning is that I may dive as deep as it is possible in theory, but I can't see how it connects to the practice, i. e. how it makes me choose the correct number of neurons in a layer, how many layers, the activation functions, if I should use a neural network or other techniques, and so on...
If someone have something explaining that I'll be grateful
> how it makes me choose the correct number of neurons in a layer, how many layers, the activation function
Seeing massive ablation studies on each one of those in just about every ML paper should be fairly indicative that nobody knows shit about fuck when it comes to that. Just people trying things out randomly and seeing what works, copying ideas from each other resulting in some vague guidelines. It's the worst field if you want things to be logical and explainable. It's mostly labelling datasets, paying for compute and hoping for the best.
> nobody knows shit about fuck when it comes to that
This is why I've abandoned neural networks as a computational substrate for genetic programming experiments.
Tape-based UTMs may be extremely rigid in how they execute instruction streams, but at least you can eventually understand and describe everything that contributes to their behavior.
Changing the fan-out from 12 to 15 in a NN is like ancient voodoo ritual compared to realizing a program tape is probably not long enough based upon rough entropy measures.
Most things can't be learned via pure theory or pure practice. Almost nothing related to work in the modern day can.
In ML not everything can be derived from theory. If it could, we'd not have been so surprised by the performance of really really large language models. At the same time, if you can't reason about the math involved, you are going to have a difficult time figuring why something isn't working or what options you have - could be around architecture or loss functions or choice of activation function or optimizer or hyperparameters or training time/resources or a dozen other things.
> In ML not everything can be derived from theory.
And not every theory in ML has a lot of applications to practice. For example, statistical learning theory has only limited relevance in practice, and algorithmic learning theory has basically none at all. There are a lot of mathematical theories that are relatively old (often much older than the deep learning boom and definitely older than transformers) and that are more interesting from a conceptual perspective rather than from the point of practical applications.
ML practice has for the moment far outstripped ML theory. But even if ML theory catches up, the answers to your question will get likely be still dependent on the nature of the process generating the data and hence they would still have to be answered empirically. I see the value of theory more in providing a general conceptual framework. Just as the asymptotic theory of algorithms today cannot tell you which algorithm to use, but gives you some broad guidance.
> the answers to your question will get likely be still dependent on the nature of the process generating the data and hence they would still have to be answered empirically.
And I think that would be perfectly fine, or rather weird if otherwise. Part(*) of the unpredictability of ML models stems from the fact that the training data is unpredictable.
What is missing for me so far are more detailed explanations how the training data and task would influence specific decisions in model architecture. So I wouldn't expect a hard answer in the sense of "always use this architecture or that amount of neurons" but rather more insight what effects a specific architecture would have on the model.
E.g. every ML 101 course teaches the difference between single-layer and "multi"-layer (usually 2-layer) perceptrons: Linear separability, XOR problem etc.
But I haven't seen a lot of resources about e.g. the differences between 2- and 3-layer perceptrons, or 3- and 32-layer, etc. Similarly, how are your model capabilities influenced by the number of neurons inside a layer, or for convolutional layers, by parameters such as kernel dimensions, stride dimensions, etc? Same for transformers: What effects do embedding size, number of attention heads and number of consecutive transformer layers have on the model's abilities? How do I determine good values?
I don't want absolute numbers here, but rather any kind of understanding at all how to choose those numbers.
(There are some great answers in this thread already)
(* part of it, not all. I'm starting to get annoyed by the "culture" of ML algorithm design that seems to love throwing in additional sources of randomness and nondeterminism whenever they don't have a good idea what to do otherwise: Randomly shuffling/splitting the training data, random initialization of weights, random neuron/layer dropouts, random jumps during gradient descent, etc etc. All fine if you only care about statistics and probability distributions, but horrible if you want to debug a specific training setup or understand why your model learned some specific behavior).
> every ML 101 course teaches the difference between single-layer and "multi"-layer (usually 2-layer) perceptrons: Linear separability, XOR problem etc.
Yeah, that's the point! ML related stuff seems to be starting with simpler problems like linear separation and XOR, then diving into some math, and soon it shows a magical python code out of nowhere that solves a problem (e.g. MNIST) and only that problem
Beginner here. A takeaway I got from the Andrew Ng's Coursera course (specifically for neural networks) is that adding more neurons and layers than the "minimum needed" is usually okay (that is, no risk of overfitting when considering reasonable regularization terms.) Sadly, there is no rule for that minimum, so you must do trial and error; on the other side, carelessly extending the network will be inefficient and eventually slow.
For the activation functions, the output layer's is mostly determined by the problem being tackled, and for the inner layers you usually start with ReLU and then try some of the common variants using some heuristics (again related to the current problem.)
Of course you should consider other successful models for similar problems as your starting point.
Meat pro's told me linear algebra and some differential calculus is the bare minimum. That's because some classes are designed to build on only that. However, I think statistics and probability would be helpful since they keep using techniques from both. Also, you can solve many problems without deep learning just using older, statistical methods.
Essential Math for AI: Next-Level Mathematics for Efficient and Successful AI Systems by Hala Nelson provides a broad and comprehensive overview. The author provides a "big picture" view (important for AI/ML study since it is easy to get lost in specific details) and relates concepts to each other at a high level thus enhancing knowledge comprehension and assimilation.
I guess the real question would be, given a budget of P parameters, how many hidden layers in something like a multi-layer perception is a good idea, and what are the size of those hidden layers? As well as questions like is it every a good idea to have a hidden layer that is larger than the previous layer (including the input) ? Or are you just wasting compute / parameter space?
I'm no expert, but a "rule of thumb" might be the more non-linear the system is, the more hidden layers you would want.
Also let us consider the information in the input vector from the perspective of compression.
How much you can compress without losing information depends on the entropy of the system. Low entropy = high compression ratio, while high entropy = low compression. High entropy is essentially noise (total disorder), on the other hand very low entropy just doesn't have much information (like a very long string of 10101010 ...)
Most "interesting data" (video/audio/images) can be compressed at ratios of about 50% before information loss kicks in. Note: text can be compressed quite heavily, but that is partially because the encoding is extremely inefficient - e.g. 8-bits per char when really only ~5 are needed, and also of much lower entropy (only ~30k words in the English language, for example)
On the other hand, information loss might not be such a bad thing if the input data has extraneous information, which it often does. This is why video, audio and image data can be compressed at ratios 10x-20x before noticeable loss of quality.
So I think the answer would be, you don't want to decrease the size of the previous layer, especially the input layer, by more than about 10x-20x.
Just curious, how "deep" have you gone into the theory? What resources have you used? How strong is your math background?
Unfortunately a lot of the theory does require some heavy mathematics, the type you won't see in a typical undergraduate degree even for more math heavy subjects like physics. Topics such as differential geometry, metric theory, set theory, abstract algebra, and high dimensional statistics. But I do promise that the theory helps and can build some very strong intuition. It is also extremely important that you have a deep understanding of what these mathematical operations are doing. It does look like this exercise book is trying to build that intuition, but I haven't read it in depth. I can say it is a good start, but only the very beginning of the theory journey. There is a long road ahead beyond this.
> how it makes me choose the correct number of neurons in a layer, how many layers,
Take a look at the Whitney embedding theorem. While this isn't a precise answer, it'll help you gain some intuition about the minimal number of parameters you need (and the VGG paper will help you understand width vs depth). In a transformer, the MLP layer post attention scales up 4x the dimensions before coming down, which allows for untangling any knots in the data. While 2x is the minimum, 4x creates a smoother landscape and so the problem can be solved more easily. Some of this is discussed in paper (Schaeffer, Miranda, and Koyejo) that counters the famous Emergent Abilities paper by Wei et al. This should be discussed early on in ML courses when discussing problems like XOR or the concentric circle. These problems are difficult because in their natural dimension you cannot draw a hyperplane discriminating them, but by increasing the dimensionality of the problem you can. This fact is usually mentioned in intro ML courses but I'm not aware of one that contains more details such as a discussion of the Whitney embedding theorem that allow you to better generalize the concepts here.
> the activation functions
There's a very short video I like that visualizes Gelu[0], even using the concentric circles! The channel has a lot of other visualizations that will really benefit your intuition. You may see where the differential geometry background can provide benefits. Understanding how to manipulate manifolds is critical to understanding what these networks are doing to the data. Unfortunately these visualizations will not benefit you once you scale beyond 3D as weird things happen in high dimensions, even as low as 10[1]. A lot of visual intuition goes out the window and this often leads people to either completely abandon it or make erroneous assumptions (no, your friend cannot visualize 4D objects[2,3] and that image you see of a tesseract is quite misleading).
The activation functions provide non-linearity to the networks. A key ingredient missing from the preceptron model. Remember that with the universal approximation theorem you can approximate any smooth, Lipschitz-continuious function, over a closed boundary. You can, in simple cases, relate this to Riemann Summation, but you are using smooth "bump functions" instead of rectangles. I'm being fairly hand-wavy here on purpose because this is not precise but there are relationships to be found here. This is a HN comment, I have to overly simplify. Also remember that a linear layer without an activation can only perform Affine Transformations. That is, after all, what a matrix multiplication is capable of (another oversimplification).
The learning curve is quite steep and there's a big jump from the common "it's just GMMs" or "it's just linear algebra" that is commonly claimed[4]. There is a lot of depth here, and unfortunately due to the hype there is a lot of stuff that says "deep" or "advanced mathematics" but it is important to remember that these terms are extremely relative. What is deep to one person is shallow to another. But if it isn't going beyond calculus, you are going to struggle, and I am extremely empathetic to that. But again, I do promise that there is a lot of insight to be gained by digging into the mathematics. There is benefit to doing things the hard way. I won't try to convince you that it is easy or that there isn't a lot of noise surrounding the topic, because that'd be a lie. If it were easy, ML systems wouldn't be "black boxes"![5]
I would also encourage you to learn some meta physics. Something like Ian Hacking's representing and Intervening is a good start. There are limitations to what can be understand through experimentation alone, famously illustrated in Dyson's recounting of then Fermi rejected his paper[6]. There is a common misunderstanding of the saying "with 4 parameters I can fit an elephant and with 5 I can make it wiggle its trunk." [6] can help provide a better understanding to this, but we truly do need to understand the limitation of empirical studies. Science relies on the combination of empirical studies and theory. They are no good without the other. This is because science is about creating causal models, so one must be quite careful and be extremely nuanced when doing any form of evaluation. The subtle details can easily trick you.
[5] I actually dislike this term. It is better to say that they are opaque. A black box would imply that we have zero insights. But in reality we can see everything going on inside, it is just extremely difficult to interpret. We also do have some understanding, so the interpretation isn't impenetrable.
I wonder what kind of contributions can you make with a strong math background versus someone with just undergrad math background (engineer)? I know it's a vague question and it's not so cut and dry, but I've lately been thinking about theory vs practise, and feel a bit ambivalent towards theory (even though I started with theory at first and loved it) and also a bit lost, mostly due to the steep learning curve, i.e. having to go beyond undergrad math (CS student with undergrad math background). I guess it depends on what you want to do in your career and what problems you are working on, but what changed my view on theory was looking at other people with little math background or with only undergrad math background at most, that still were productive in creating useful applications and or producing research papers in DL, which showed to me that what is more important is having a strong analytical mind, being a good engineer and being pragmatic. With those qualities it feels like you can go top-down approach when trying to fill in gaps in your knowledge, which I guess is possible because DL is such an empirical field at the moment.
So to me it feels like the "going beyond undergrad math" formally is more if you want to be able to tackle the theoretical problems of DL, in which case you need all the help you can get from theory (perhaps not just math, but even physics and other fields might help as well to view a problem through more than one lens). IMO, it's like casting a wide net, where the more you know the bigger the net is and hope that something sticks. Going the math education route is a safe way to expand this net.
I also wonder about that, e.g., considering the team behind deepseek, was it more important for them to have great engineering skills vs strong math backgrounds to achieve this success?
It's a combination that creates the magic. I'm a big believer in that you need to spend time learning math as well as learning programming and computer architecture. The algorithms are affected by all these things (this is why teams work best. But you need the right composition).
I'm a researcher and still early in my career. I'm no rockstar but I'm definitely above average if you consider things like citations or h-index. Most of my work has been making models more efficient, using fewer resources. Mostly because lack of gpu access lol. My is more on density estimation though (generative modeling)
And to be clear, I'm not saying you need to sit and do calculations all day. But learning these maths is necessary for the intuition and being able to apply that to real world problems.
I'll give a real world example though. I was interning at a big company last year and while learning their framework I was playing around with their smaller model (big one wasn't released yet). While training I recognized it was saturating early on and looking at the data I immediately recognized there were generalization issues. I asked for a week to retain the model (I only had a single V100 available despite company resources). By the end of the week I had something really promising but I was still behind on accuracy of the internal test set. I was convinced though because I can understand what causes generalization and the baked in biases of the data acquisition. My boss was not convinced and I was asking for other test sets and customer data. Begrudgingly it was given to me. I run the test and I 3x'd the performance. Being neck and neck with their giant model that had tons of pertaining (a few percent behind). Dinky little ResNet model beating a few hundred million param transformer. Few hours to train vs weeks. My boss was shocked. His boss was shocked (who was very anti theory). Even got emails asking how I did it from top people. I say that everything I did only works better on transformers and we should implement it there (I have experience with similar models at similar scales). And that's the end of the story. Nothing happened. My version wasn't released to customers nor were the additions I made to the training algorithms merged (all things were optional too, so no harm).
That's been pretty representative of my experience so far though. I can smash some metric at a small scale and mostly people say "but does it scale" and then do not give me the requisite compute to attempt it. I've seen this pattern with a number of people doing things like me. I'm far from alone and I've heard the same story at least a dozen times. The truth is to compete with these giant models you still need a lot of compute. You can definitely get the same performance with 10x and maybe even 100x fewer parameters or lower cost, but 1000x is a lot harder. I'm more concerned that we aren't really providing good pathways to grow. Science always has worked by starting small then scaling. Sure, a lot fails along the way but you have to try. The problem with GPU poor not being able to contribute to research is more gate keeping than science. But I don't think that should be controversial when you look at other comments in this thread. People say "no one knows" as if the answer is "no one can know, so don't try". That's very short sighted. But hey, it's not like there's another post today with the exact same sentiment (you can find my comment there too) https://news.ycombinator.com/item?id=43447616
Thank you for that insightful comment! "Starting small, research and scale then" is really a pattern often overlooked these days. I wish you all the best for your future endeavours.
Haha well it's pretty hard to start big if you don't have the money lol. And thanks! I just want to see our machines get smarter and to get people to be open to trying more ideas. Until we actually have AGI I think it's too early to say which method is going to definitely lead us there
The fact that it is still as much of an art as it is a science means there is no “correct” values for these things. Only heuristics and guess-and-checks.
This has always been my problem trying to learn it. It seems like throwing stuff at a wall and seeing what sticks.
The maths isn’t hard for me but the explanations of ‘why does this work better than that’ are always super hand wavey. Or actually quite often it’s “it doesn’t but it’s faster to compute”
[Edit] I seem to have turned this into somewhat of an information dump...
Like other commenters said, you typically find those out by just trying them out one by one and seeing what works. However, you can prune the search space considerably given you know a few things. These range from theory, to large experimental results. For example, if google or someone widely deploys a certain configuration, other people just use that. If large experiments show that this and this setting for Adam works well for NLP, other people just use that when working on NLP problems. There was a large experiment done that showed that the best activation functions were of the form alphasigmoid(betax). Sigmoid, tanh, Gelu, are all of this form. Stuff like this is unfortunately the majority of the knowledge. In fact, ReLU is being used without there even being a universal approximation theorem[1] for networks using it! The canonical one only works when sigmoids are used. No one cared, because it worked in practice.
Typically, theoretical results are difficult to come by for such a general model structure as neural networks. Think about it, a theoretical result "for all neural networks" has very little logical statements i.e constraints to work with, that will then combine to produce other logical statements. So, you would see theoretical results for a subset of architectures. This is because the constraints that generate this subset give us more to work with, and we can combine them in some way and give a theorem or proof. Then, people find out empirically that it works well for a more general network, too. An example of this type of result is "dropout". The empirical motivation for it was trying to train ensemble networks for cheap. In an attempt to rest it on some theoretical grounding, it was shown that for linear models it is equivalent to adding noise to the input, which can be shown to be a good regularizer. But there is no proof for more complex architectures. In practice, it works anyway. But, you're not sure, so you include it in your hyperparameter search.
There is some good theoretical grounding for many regularization methods. My favorite is the proof that the very straightforward L2 regularization on SGD, can be shown to exactly limit the unimportant features, while not regularizing much the important features. You can also search "stein's lemma neural networks". I found [2], which is a talk on this topic, and it is by Anima Anandkumar - always a good sign.
For activation functions, it is mostly that experimental result that everyone relies on.
The universal approximation theorem [1] says that even a single layer is enough to represent any function. However, there is a practical difficulty in training these single-layer networks. Deepening the network provides a lot of efficiency advantages. Notably, for certain classes of functions, it provides an exponential advantage (Eldan and Shamir 2016). There is a wishy-washy(IMHO) theory called the Information Bottleneck Theory, which tries to show that multiple layers stack on top of each other, each uncovering one level of "heirarchy" in the data distribution. This is seen in practice (see StyleNet) but the theory is a little weak, again IMHO.
There is also a lot of tweaks done to the architecture in the name of preventing the "Vanishing Gradients" problem - this is a problem that arises because we use backpropagation to train these networks. There is _some_ theory to help understand this, that comes out of random matrix theory. But I don't know much of it.
There is the old VC dimension theory of model complexity, but that doesn't cleanly apply to neural networks as far as I have seen.
[1] in case you are unaware, this is the theorem that makes pursuing neural networks sound in the first place. It says that you can always make a neural network that computes an arbitrary function up to an arbitrary precision threshold.
NFL says something about it being a wash for arbitrary data. All results are going to be tuned to assumptions we have about our data in particular (not too many discontinuities, sufficiently well sampled, ...).
> neurons in a layer, how many layers, ...
Scaling laws are, currently, empirically derived. From those you can pick your goals (e.g., at most $X and maximize accuracy) and work backward to one or more optimal sets of parameters. Except in very restricted domains or with other strong assumptions I haven't seen anything giving you more than that.
> activation functions
All of the above about how it can't matter for arbitrary data and how parameters need to be empirically derived apply. However: An important inductive bias a lot of practitioners use is that every weight in the model should be roughly equally important. There are other ways you choose activation functions, especially in specialized domains, but when designing a deep network one of the most important things you can do is control the magnitude of information at each level of backpropagation. If your activation function (and surrounding infrastructure) approximately handles that problem then it's probably good enough.
> neural network or other techniques
For almost every problem you're better off using something other than a neural network (like catboost). I don't have any good intuition for why that's the case. Test them both. That's what the validation dataset is for.
> how it connects to the practice
For this article in particular, it doesn't connect to a ton of what I personally do. I'm sure it resonates with someone. As soon as pytorch or jax or whatever isn't good enough though and you have to go implement stuff from scratch, you need a deep dive in the theory you're implementing. To a lesser degree, if you're interfacing with big frameworks nontrivially or working around their limitations, you still need a deep understanding of the things you're implementing.
Imagine, e.g., that you want all the modern ML tools in a world where dynamic allocation, virtual functions, and all that garbage aren't tractable. You can resoundedly beat every human heuristic for phantom touchpad events in your mouse driver with a tiny neural network, but you can't use pytorch to do it without turning your laptop into a space heater.
Embedded devices aren't the only scenario where you might have to venture off the beaten path. Much like the age-old argument of importing a data structure vs writing your own, as soon as you have requirements beyond what the library author provides it's often worth it to do the whole thing on your own, and it takes a firm theoretical foundation to do so swiftly and correctly.
> how it connects to practice
That's a criticism I have of a lot of educational materials. Connecting the dots is important in writing (competing with all the advantages of brevity).
Pick on the Model-Based Learning section as an example. We're asked, to start, to MLE a gaussian. (M)aximum (L)ikelihood (E)stimation is an extremely important concept, and a lot of ML practitioners throw it to the side.
Imagine, e.g., a 2-stage process where for each price bracket you have a model reporting the likelihood of conversion and then a second stage where you synthesize those predictions into an optimal strategy. Common failure modes include (a) mishandling variance, (b) assuming that MLE on each of the models allows you to combine the mean/mode/... results into an MLE composite action, (c) really an extension of [b], but if you have the wrong loss function for your model(s) then they aren't meaningfully combinable, ....
Something that should be obvious (predict conversion rates, combine those rates to determine what you should do) has tons of pitfalls if you don't holistically reason about the composite process. That's perhaps a failure in the primitives we use to construct those composite processes, but in today's day and age it's still something you have to consider.
How does the book connect? I dunno. It looks more like a "kata" (keep your fundamental skills sharp) than anything else. An explicit connection to some real-world problem might make it more tractable.
This is what I was expecting. Very much appreciated. OP’s paper is good - but I sort of feel like it’s singing to the choir. It’s a great resource if you already know the material.
Looks neat! My only criticism would be that the solutions are given right after the questions so I couldn't help to read the answer of one question before thinking it through by myself.
This is really neat! I work in machine learning but still feel imposter syndrome with my foundations with math (specifically linear algebra and matrix/tensor operations). Does anyone have any more good resources for problem sets with an emphasis on deep learning foundational skills? I find I learn best if I do a bit of hands-on work every day (and if I can learn things from multiple teachers’ perspectives)
Depends on your definition of "ML practitioner", "building" and "ML". Look at the section on optimization - some people have an extremely good grasp of this and it helps them mentally iterate through possible loss functions and possible ways to update parameters and what can go wrong.
Some people see studying as a chore and want to learn the minimum to get the job done. Others find it insightful and fun and enjoy doing problems and reading material.
Both approaches make contributions and can lead to success, but in different ways.
I am curious about the same thing. I worked as a ML engineer for several years and have a couple of degrees in the field. Skimming over the document, I recognized almost everything but I would not be able to recall many of these topics if asked without context, although at one time I might have been able to.
What are others' general level of recall for this stuff? Am I a charlatan who never was very good at math or is it just expected that you will forget these things in time if you're not using them regularly?
Not sure which other topics you mean, but "1000 exercises in probability" should keep you busy for a while (one can find the PDF online). For other math oriented riddles, check out "The colossal book of short puzzles and problems" and "The art and craft of problem solving"
Funny how mathematicians always try to sneak their linear algebra and matrix theory into ML. If you didn't know any better, you'd think academicians had invented LLMs and are the experts to be consulted with.
If anything academicians and theoreticians held ML back and forced generations of grad students doing symbolic proofs, like in this example, just because computational techniques were too lowbrow for them.
Are math skills really? Most aspects of deep learning don't require a deep understanding of mathematics to understand. Backprop, convolution, attention, recurrent networks, skip connections, GANs, RL, GNNs, etc. can all be stood with only simple calculus and linear algebra.
I understand that the theoretical motivation for models is often more math-heavy, but I'm skeptical that motivations need always be mathematical in nature.
I think CNNs follow very naturally from the notion of shift/spatial invariance of visual processing. That doesn't require a mathematical understanding.
Every MLE who didnt study Math really likes to downplay its importance. Yeah you dont need measure theoretic probability, but you need a grasp of Lin Alg to structure your computations better. Remember the normalization that we do in attention ? That has a math justification. So I guess yeah academics did have a role in building LLMs.
I mean computer scientists really do like to pretend like they invented the whole field. Whereas in reality the average OS, compilers, networks class has nothing to do with core ML. But of course are also important and these barbs dont get us anywhere.
I think you might've taken my point too strongly. Of course math is very useful, and certain contributions are purely mathematical. I just don't think it is as hard of a requirement for innovation as was claimed.
Just curious for you or anyone else, what would make such an app compelling for you to use? And maybe not one that's just aimed at learning the content of this document, but if you'd like to think more broadly, an app aimed at helping you learn and retain things that you're currently interested in, studying, etc.
For these machine learning problems specifically, feel like there are so many people that would greatly benefit from having some form of spaced repetitive practice (as you mention like the adaptive Khan Academy style app), or some other easy-to-use format. I just wonder what other features people would want that would make them want to use something like this over learning with other resources (e.g., YouTube videos, reading books, etc.)
For me its about a sense of progress, like in chess you can have an ELO score. Or in Duolingo theres a roadmap. If there were levels to this you could get more confident in your abilities.
Right now the levels are basically bachelors, masters, and PhD. Coarse and expensive
If someone have something explaining that I'll be grateful