Hacker News new | past | comments | ask | show | jobs | submit login

Just curious, how "deep" have you gone into the theory? What resources have you used? How strong is your math background?

Unfortunately a lot of the theory does require some heavy mathematics, the type you won't see in a typical undergraduate degree even for more math heavy subjects like physics. Topics such as differential geometry, metric theory, set theory, abstract algebra, and high dimensional statistics. But I do promise that the theory helps and can build some very strong intuition. It is also extremely important that you have a deep understanding of what these mathematical operations are doing. It does look like this exercise book is trying to build that intuition, but I haven't read it in depth. I can say it is a good start, but only the very beginning of the theory journey. There is a long road ahead beyond this.

  > how it makes me choose the correct number of neurons in a layer, how many layers,
Take a look at the Whitney embedding theorem. While this isn't a precise answer, it'll help you gain some intuition about the minimal number of parameters you need (and the VGG paper will help you understand width vs depth). In a transformer, the MLP layer post attention scales up 4x the dimensions before coming down, which allows for untangling any knots in the data. While 2x is the minimum, 4x creates a smoother landscape and so the problem can be solved more easily. Some of this is discussed in paper (Schaeffer, Miranda, and Koyejo) that counters the famous Emergent Abilities paper by Wei et al. This should be discussed early on in ML courses when discussing problems like XOR or the concentric circle. These problems are difficult because in their natural dimension you cannot draw a hyperplane discriminating them, but by increasing the dimensionality of the problem you can. This fact is usually mentioned in intro ML courses but I'm not aware of one that contains more details such as a discussion of the Whitney embedding theorem that allow you to better generalize the concepts here.

  > the activation functions
There's a very short video I like that visualizes Gelu[0], even using the concentric circles! The channel has a lot of other visualizations that will really benefit your intuition. You may see where the differential geometry background can provide benefits. Understanding how to manipulate manifolds is critical to understanding what these networks are doing to the data. Unfortunately these visualizations will not benefit you once you scale beyond 3D as weird things happen in high dimensions, even as low as 10[1]. A lot of visual intuition goes out the window and this often leads people to either completely abandon it or make erroneous assumptions (no, your friend cannot visualize 4D objects[2,3] and that image you see of a tesseract is quite misleading).

The activation functions provide non-linearity to the networks. A key ingredient missing from the preceptron model. Remember that with the universal approximation theorem you can approximate any smooth, Lipschitz-continuious function, over a closed boundary. You can, in simple cases, relate this to Riemann Summation, but you are using smooth "bump functions" instead of rectangles. I'm being fairly hand-wavy here on purpose because this is not precise but there are relationships to be found here. This is a HN comment, I have to overly simplify. Also remember that a linear layer without an activation can only perform Affine Transformations. That is, after all, what a matrix multiplication is capable of (another oversimplification).

The learning curve is quite steep and there's a big jump from the common "it's just GMMs" or "it's just linear algebra" that is commonly claimed[4]. There is a lot of depth here, and unfortunately due to the hype there is a lot of stuff that says "deep" or "advanced mathematics" but it is important to remember that these terms are extremely relative. What is deep to one person is shallow to another. But if it isn't going beyond calculus, you are going to struggle, and I am extremely empathetic to that. But again, I do promise that there is a lot of insight to be gained by digging into the mathematics. There is benefit to doing things the hard way. I won't try to convince you that it is easy or that there isn't a lot of noise surrounding the topic, because that'd be a lie. If it were easy, ML systems wouldn't be "black boxes"![5]

I would also encourage you to learn some meta physics. Something like Ian Hacking's representing and Intervening is a good start. There are limitations to what can be understand through experimentation alone, famously illustrated in Dyson's recounting of then Fermi rejected his paper[6]. There is a common misunderstanding of the saying "with 4 parameters I can fit an elephant and with 5 I can make it wiggle its trunk." [6] can help provide a better understanding to this, but we truly do need to understand the limitation of empirical studies. Science relies on the combination of empirical studies and theory. They are no good without the other. This is because science is about creating causal models, so one must be quite careful and be extremely nuanced when doing any form of evaluation. The subtle details can easily trick you.

[0] https://www.youtube.com/watch?v=uiB97cPEVxM

[1] https://www.penzba.co.uk/cgi-bin/PvsNP.py?SpikeySpheres

[2] https://www.youtube.com/shorts/_n7TMDnYdVY

[3] https://www.youtube.com/watch?v=FfiQBvcdFG0

[4] https://news.ycombinator.com/item?id=43418334

[5] I actually dislike this term. It is better to say that they are opaque. A black box would imply that we have zero insights. But in reality we can see everything going on inside, it is just extremely difficult to interpret. We also do have some understanding, so the interpretation isn't impenetrable.

[6] https://www.youtube.com/watch?v=hV41QEKiMlM




I wonder what kind of contributions can you make with a strong math background versus someone with just undergrad math background (engineer)? I know it's a vague question and it's not so cut and dry, but I've lately been thinking about theory vs practise, and feel a bit ambivalent towards theory (even though I started with theory at first and loved it) and also a bit lost, mostly due to the steep learning curve, i.e. having to go beyond undergrad math (CS student with undergrad math background). I guess it depends on what you want to do in your career and what problems you are working on, but what changed my view on theory was looking at other people with little math background or with only undergrad math background at most, that still were productive in creating useful applications and or producing research papers in DL, which showed to me that what is more important is having a strong analytical mind, being a good engineer and being pragmatic. With those qualities it feels like you can go top-down approach when trying to fill in gaps in your knowledge, which I guess is possible because DL is such an empirical field at the moment.

So to me it feels like the "going beyond undergrad math" formally is more if you want to be able to tackle the theoretical problems of DL, in which case you need all the help you can get from theory (perhaps not just math, but even physics and other fields might help as well to view a problem through more than one lens). IMO, it's like casting a wide net, where the more you know the bigger the net is and hope that something sticks. Going the math education route is a safe way to expand this net.


I also wonder about that, e.g., considering the team behind deepseek, was it more important for them to have great engineering skills vs strong math backgrounds to achieve this success?


It's a combination that creates the magic. I'm a big believer in that you need to spend time learning math as well as learning programming and computer architecture. The algorithms are affected by all these things (this is why teams work best. But you need the right composition).

I'm a researcher and still early in my career. I'm no rockstar but I'm definitely above average if you consider things like citations or h-index. Most of my work has been making models more efficient, using fewer resources. Mostly because lack of gpu access lol. My is more on density estimation though (generative modeling)

And to be clear, I'm not saying you need to sit and do calculations all day. But learning these maths is necessary for the intuition and being able to apply that to real world problems.

I'll give a real world example though. I was interning at a big company last year and while learning their framework I was playing around with their smaller model (big one wasn't released yet). While training I recognized it was saturating early on and looking at the data I immediately recognized there were generalization issues. I asked for a week to retain the model (I only had a single V100 available despite company resources). By the end of the week I had something really promising but I was still behind on accuracy of the internal test set. I was convinced though because I can understand what causes generalization and the baked in biases of the data acquisition. My boss was not convinced and I was asking for other test sets and customer data. Begrudgingly it was given to me. I run the test and I 3x'd the performance. Being neck and neck with their giant model that had tons of pertaining (a few percent behind). Dinky little ResNet model beating a few hundred million param transformer. Few hours to train vs weeks. My boss was shocked. His boss was shocked (who was very anti theory). Even got emails asking how I did it from top people. I say that everything I did only works better on transformers and we should implement it there (I have experience with similar models at similar scales). And that's the end of the story. Nothing happened. My version wasn't released to customers nor were the additions I made to the training algorithms merged (all things were optional too, so no harm).

That's been pretty representative of my experience so far though. I can smash some metric at a small scale and mostly people say "but does it scale" and then do not give me the requisite compute to attempt it. I've seen this pattern with a number of people doing things like me. I'm far from alone and I've heard the same story at least a dozen times. The truth is to compete with these giant models you still need a lot of compute. You can definitely get the same performance with 10x and maybe even 100x fewer parameters or lower cost, but 1000x is a lot harder. I'm more concerned that we aren't really providing good pathways to grow. Science always has worked by starting small then scaling. Sure, a lot fails along the way but you have to try. The problem with GPU poor not being able to contribute to research is more gate keeping than science. But I don't think that should be controversial when you look at other comments in this thread. People say "no one knows" as if the answer is "no one can know, so don't try". That's very short sighted. But hey, it's not like there's another post today with the exact same sentiment (you can find my comment there too) https://news.ycombinator.com/item?id=43447616


Thank you for that insightful comment! "Starting small, research and scale then" is really a pattern often overlooked these days. I wish you all the best for your future endeavours.


Haha well it's pretty hard to start big if you don't have the money lol. And thanks! I just want to see our machines get smarter and to get people to be open to trying more ideas. Until we actually have AGI I think it's too early to say which method is going to definitely lead us there


Thanks for your time! Just added your commentary to my favorites! :-)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: