It seems like this can leave the reader with the wrong impression. Calculus really is "the mathematics of Newtonian physics". This is just "some mathematics that might help a bit in your intuitions of deep learning".
IE, Deep learning is fundamentally just about getting the mathematically simple but complex and multi-layerd "neural networks" to do stuff. Training them, testing them and deploying them. There are many intuitions about these things but there's no complete theory - some intuitions involve mathematical analogies and simplifications while other involve "folk knowledge" or large scale experiments. And that's not saying folks giving math about deep learning aren't proving real things. It's just they characterizing the whole or even a substantial part of such systems.
It's not surprising that a complex like a many-layered Relu network can't fully characterized or solved mathematically. You'd expect that of any arbitrarily complex algorithmic construct. Differential equations of many variables and arbitrary functions also can't have their solutions fully characterized.
As a PhD student who sort of burned out on this type of research, I agree that the complexity of Neural Networks as a mathematical construct makes them very difficult to analyze. This might also have to do with Deep learning theory being a subset of learning theory which is subject to "No Free Lunch" [1], which means that you always have to be very careful not to try to prove something that turns out to be impossible.
That being said, research on the Kernel regime is one of the very cool ideas, in my opinion, to gain traction in this field in the past few years. To summarize: "If you make a neural network wide enough, it gains the power to control its output on each individual input separately, and will begin to fit its training data perfectly". Of course, the real pleasure is in understanding all the mathematical details of this statement!
I got my master's years ago so now I'm a strict amateur. That said, I don't think the "No free lunch theorem" is very "interesting". It's nearly tautological that no approximation method works for "any" function. The set of predictable/interesting/useful/"real-world" functions is going to have measure 0 compared to white noise so "any function" will basically look like white noise and can't be predicted. Approximating functions/sequences with vanishingly low Kolmogorov complexity is more interesting, impossible in general by Godel's theorem but what's the case "on average"? (depends on the choice process and so ill-defined but defining might be interesting). The kernel regime stuff looks interesting but I don't know it's relation to wide networks.
Neural networks "tend to generalize well in the real world". That's a pretty fuzzy statement imo since "real world" is hardly defined but it's still what people experience and it's more useful to provide a more precise model where this works rather than a model where this doesn't work.
Also, there's good theory on deep networks as universal well as theories of wide/shallow networks [1].
> Neural networks "tend to generalize well in the real world".
I've always interpreted that as "we've found an algorithm that could, given a foreseeable amount of computing power and maybe some tweaks, simulate human decision making".
It isn't so much that neural networks can approximate the real world as they can approximate human perception of the real world.
Well, I quote the statement to show how vague it is, among things.
Neural networks are "universal approximators" in that they work as well as virtually any previous approximation method. So given big snapshot of input data and human judgement on it, they can approximate that. They can also approximate a snapshot of some input-output pairs not produced by human but having patterns (solutions to differential equations, for example).
So, they can approximate what humans do in a given domain. But there's no reason to think they're acting in the same way as humans and I'd say very few people seriously working on ML believe that.
This intuition is very dangerous and leads to huge misconceptions about deep neural nets.
Neural nets don't learn anything like us, and they don't reproduce our functions. We build on massive amounts of general symbolic knowledge, and can zero shot tasks (without explicit examples) easily.
Neural networks really should be seen as just giant random functions that you progressively modify in tiny ways until they fit your data. As parent says, we've just been lucky or good at constraining these functions in a way that they can only learn useful functions (ie convnets) or that they somehow learn these more quickly
Humans certainly do not build on massive amounts of symbolic knowledge because we are absolutely terrible at symbolic knowledge. Reliably reasoning through a basic logical argument is a specialist skill. Even reviewing evidence before making decisions is uncommon, most humans operate on a look -> assess -> do model where the tricky bit is well approximated by a neural net. Which is why neural nets seem to be so good at real-world tasks.
It is completely plausible that when neural nets get scaled up to something approaching human-brain numbers of connections they will well approximate a human brain or be a few tweaks away. Obviously it won't be knowable until state of the art gets there, but there is no reason to think human intelligence is going to be complicated. It is one evolutionary step up from some pretty basic animals.
Maybe you’re talking about a different kind of symbolic knowledge than the OP. To give an example humans can instantly tell whether an arbitrary sentence is grammatical or not which is a deep kind of symbolic reasoning that computers absolutely cannot do right now. And humans can also get the semantic meaning.
Then again math is hard for us. So I think there are nuances.
The fact that computers can't do sentence grammar and meaning right now doesn't tell us anything much about similarities or differences between humans and neural nets. It just tells us that training a neural net purely on a big corpus isn't enough to derive semantic meaning and makes it hard to work out grammatical meaning. No human has ever tried to do that either, everyone comes at text with some real-world experience. So we don't know how well they would do at it. Probably terribly.
It is reasonable to believe that written language is easier to train on a neural net that is trained on both images and words so it can form visual links between words. Maybe that takes more computational grunt than we have at the moment. The failure so far proves nothing.
instantly tell whether an arbitrary sentence is grammatical or not
You do realize we can train a neural network to perform this task? It is a binary classification problem. When I look at a grammatically incorrect sentence I don't do much symbolic reasoning - it just feels "wrong" to me. It does not match any patterns I have in my head for grammatically correct sentences. There's a lot of pattern matching in our thinking process.
What's missing in the current generation of neural networks is efficient information storage and ability to recall that information (e.g. lookup) or update it (direct write).
"You do realize we can train a neural network to perform this task"
I'm doing a master's in deep learning for NLP and I'm not sure we can. Language modelling can't do this because grammatical yet semantically implausible combinations of words yield very low perplexity, like the classic being Noam Chomsky's "Colorless green ideas sleep furiously".
What would be a training set for this? I assume we would first try to do parsing to extract the grammatical role of each word. Then what would be the dataset? A massive attempt at generating the set of all possible trees that are grammatical?
I guess we could use massive textual datasets from reputable sources and extract their grammatical role tree, and learn from that. Generating negative examples with sufficient coverage would be very hard. Strict generative modelling without negative examples with good coverage would see the same problem as with language modelling, where acceptable but unlikely examples would have low perplexity despite being good.
It would seem to me that in order to generate negative examples with good coverage, your would need to have a man made program with a definition of what grammaticality means, which would make making a neural network useless to begin with.
Constructing a training dataset is a separate problem. You could potentially crowdsource enough negative examples. Once you have the dataset, a neural network would most likely be able to learn to classify sentences with a reasonably good accuracy.
Unlike current DL models, humans have a world model (common sense) which is formed through an ability to create/update/lookup explicit rules/facts. Once we figure out how to incorporate that into a learning algorithm and/or a model architecture, AI will become a lot smarter.
If we can train a computer to classify sentences as grammatical or not please let me know where. You’ll save the linguistics department a lot of money as they’ll no longer have to contact native speakers for this research.
Humans also require a lot of examples to learn a language - years of everyday practice for a young human. Learning algorithms are not the same, but you still need to train a large neural network - lots of neurons with lots of connections (weights) - whether it's in your head or in a datacenter.
There’s some evidence that humans have a Universal Grammar and learn through deletion. And humans can not learn any old language — only a restricted class — meanwhile there’s no reason to think that an ML model would have that problem.
I’d encourage you to read a little more about the topic with an open mind. You might learn something.
Neural nets fundamentally cannot operate the same way a brain does, because they cannot create an abstract representation of a problem, and then gradually and deliberately manipulate that mental model until they develop a solution. They just don't work that way, with current structures. They basically apply a single pass of a very complex function to the data, and spit out a result.
That isn't a problem of scale, it's a problem of architecture. This is one of the reasons Deepmind decided to tackle Starcraft. It's very difficult to solve Starcraft without your AI having some ability to develop and then manipulate a mental model of the game, because that's what you need to construct and unfold original, non-linear strategies.
Neural nets generalise because they have to approximate the data at a lower resolution, it's not that they're constrained to only learn what is useful. They're lossy compressors, but they have a unique property that most lossy compressors don't have. They cannot learn all the properties of the input data - partly because they can't hold that much information - but uniquely because neurons cannot be modified in isolation. A change in one neuron changes the influence of every other neuron in that layer, on the next layer. So it's difficult to learn granular properties of specific examples, because the entire net is affected when you do that (and many granular properties that are learned, will be unlearned in subsequent examples). The deeper the net, the less able earlier layers are to extract granular information from the input. They have to extract very abstract information, and they will gradually converge on an abstraction strategy that works.
That's why residual blocks are interesting. They pass that low-level information to later blocks (which have an easier time processing the granular details) while also leveraging the ability of earlier blocks to extract abstract information. It allows you to extract and combine information at multiple levels of granularity (or abstraction).
Convnets are also invariant to generalisation (e.g. translation, and to some degree scale), which I think is a better definition than "can only learn something useful." They're forced learn information that is more general, which increases the usefulness of each bit, which means you get a higher density of usefulness per FLOP. But you also lose specific information in that process. What if location is meaningful? For example, audio spectrogram analysis can suffer from that property, because specific location on the Y axis is highly meaningful.
NFL theorems aren’t an argument about noise, they’re an argument about the uncountability of real numbers. NFL states that over all problems any optimization method performs equally poorly to any other, or equivalently, _that if an optimization method does well on some problems, it must do equally poorly on some other problems_, and those others aren’t necessarily noise, they could be anything. The problem is you don’t know which problems it is going to do poorly on in advance. You hope it does poorly on noise or on problems that you don’t care about, but you can’t tell. That is a very different statement than what you’re saying, and it’s as equally non-trivial as Godels and Turings statements in decidability.
It seems like it aims at giving somebody who would like to get started doing theoretical research in the field some pointers and basic insights.
I don't think it does a particularly bad job at this, in particular given that it will be a book chapter? The target audience are probably people who have had some exposure to Functional Analysis and the likes before.
What are some subfields of mathematics that you would say are crucial for gaining a proper understanding of all the things related to deep learning (e.g. let's say the paper you linked)? Even though the theory isn't complete, I'm sure a grounding in certain fields of mathematics will be helpful.
This is always difficult to answer, and it will probably be a mixture of many, however I am currently following categorical approaches to machine learning. Category Theory is the area of mathematics that studies composable structures, i.e. like layers in a deep network. It is very abstract and was invented to solve problem in algebraic geometry, but has been fruitful in other areas as well.
That you this illustrates that the situation today is "take whatever math-stuff you have, throw it against neural networks and see what you get". IE, I'm pretty sure not much progress has been made with category theory and neural networks - but you might be the first.
I've seen differential equations, Markov chains, differential geometry and other stuff. We might be in heady days before the "big breakthrough" is made. But these constructs might be inherently pathological (even then, non-pathological variants might be possible).
Dynamical systems and chaos theory (especially for neural networks), information theory (especially for the paper linked), probability theory (especially the more foundational and axiomatic work)
Agreed. Until we get to the point where there are theorems of the form, for example, "Given a problem satisfying conditions X, the optimal number of layers to minimize expected training time for data satisfying Y is Z", it is just stamp collecting.
Tangent, but has anyone taken Fast.ai or similar courses and transitioned into the Deep Learning/ML field without a MS/PhD? To be honest, I don't even know what 'doing ML/DL' looks like in practice, but I'm just curious if a lot of folks get in to the field without graduate degrees.
You can learn all you need to know in 2 to 3 university level courses. So we are talking less than a year of university courses.
Fast.ai is too high level. I don't like it. You would be better served taking actual university courses. A few days ago people linked to LeCun's university class[1]. This is a solid introduction. Does not cover everything but that is OK. Seems like it is missing Bayesian approaches.
Then if you want to specialize in vision or speech or robotics or whatever, you take special classes on that topic and learn all the SOTA techniques.
Then you are ready to do research already, or apply your knowledge to build stuff. Of course you still have to learn how to do real machine learning, which involves all the data manipulation stuff, but that is learned by doing.
Prof. Sergey Levine is REALLY good at explaining the intuitions of DL algorithms. This class also includes lectures on ML basics and very approachable assignments.
Many classes/blog posts start with describing what a neuron is - that IMHO is a super terrible way to teach a beginner.
To understand DL, one should know why we need activations (because linear models are not enough), why we need back-propagation (because we are optimizing a loss using SGD). This class is very great at explaining those things in an intuitive way. Following through I felt I built a pretty solid ML/DL foundation for myself.
I took the fast.ai course and now I am doing a Ph.D. in Biomedical Engineering focused on applying deep learning to microscopy.
I don't think fast.ai is enough if you want to do theoretical research in deep learning, but it certainly provides enough to work on practical problems with deep learning. That said, many of us in the fastai community are able to delve deep into, understand, and implement recent deep learning papers and even develop novel techniques. So I think with a little extra studying, one could go easily transition to core deep learning research.
I'm not "in the field" yet; and I didn't take any courses I just kinda "dove in" on contributing to some open source repos because I've been a python dev for like 4 years now.
The pytorch codebase for, say, a transformer (a deep learning architecture which makes use of "attention") - is still not something I've yet grokked. I have however been able to pitch in with bug fixes as I continue learning and getting to that point.
This is how I would hope an entry-level position would be at a job. At some point companies have to realize education is just a part of it and that it takes time; particularly when things change this fast. I have no real-world clue though unfortunately.
Anyway, working on machine learning with vision is the first time I've actually felt like my work was exciting. The "result" you get is so much fun and working together with people given the proper culture is presumably a fantastic experience. I just (personally) can't get excited about using my code to write CRUD/frontend anymore. Not to imply those are the only two options; but that's been the case for me until recently.
The problem with this view is that once one gets stuck, which is very quick when one is doing the work for real, one doesn’t have any tools to debug anything except at the most basic level and most probably doesn’t understand anything intuitively enough to even reason about what the underlying problem could be.
I don’t do this work myself, but we’ve hired many interns from bootcamps to do ML, and ones from college with ML projects. The bootcamp grads with no additional background have almost universally hit hard walls once anything gets more complex than using Keras to glue together layers. It’s given me the impression, anecdotally, that bootcamps are largely predatory to take ones money and provide only a veneer of knowledge in the area. This doesn’t seem to apply to people with a CS or math background that took an ML bootcamp to add that dimension to their already-mathematical skillset. But people who have, again only anecdotally in my experience with an n of perhaps only 20, taken a bootcamp to reskill from a totally unrelated and perhaps qualitative field have not had success with a bootcamp alone, but have had success in doing what the above poster recommended in taking university courses in the area.
Very respectfully, if you’re in a boot camp right now, you’re unlikely deep enough into the day to day work of ML to make the assertion you’re making.
I think it depends! If you want to zoom out and take the "systems view" using standard components, then you probably don't need much math. If you want to develop new architectures or algorithms, then you definitely will. The well-trodden paths of ML might have most of their math abstracted away, but in my experience every time you get close to the frontiers, people are using math to understand what's going on or develop new approaches.
It also doesn't really work if you have to tackle a new problem.
I stopped studying maths well before university. I am not some kind of math super genius. But working on my own stuff, which did involve new problems, I was up the creek fairly quickly without a solid mathematical understanding of the techniques I was trying to use.
I don't think the bar is particularly high here. Solid understanding of stats, ESL...but I have seen people shotgunning models (I did this years ago too), and that isn't going to work very long.
Also, I don't really understand why you wouldn't study some of this stuff. Maths as taught in schools treats you like a meat calculator...that isn't fun. But if you are interested in ML, going through Stats, Linear Algebra...it is pretty interesting because there are so many clear connections with your work.
One example I can come up with now - image classification / segmentation / regression problems.
Unfortunately, not all data is available or provided in a data "friendly" format - sometimes all you get are image files, and similar. Maybe you want to read some value off these images, count objects, or whatever - which traditionally has been done by trained/skilled workers.
With CNNs, it _can_ be a trivial task implement models for solving the above problems. That's time and money saved for a business.
ML PhD because it does honestly open more doors to top research groups
Correction: not ML PhD by itself - publications in top conferences open doors. Looking at the acceptance rates, I'm guessing most people with ML PhDs don't have such publications.
While I can't give the exact prerequisites, I know that all of the things that appear in the paper relate to:
(1) Linear Algebra
(2) Optimization Theory (Convex Analysis, non-convex optimization) [0], [2]
(3) Probability Theory and Statistics (Measure Theory, Multivariate Statistics) [1], [3], [4], [5]
(4) Analysis, to a lesser extent. (2) and (3) are the most important.
I would give more references, but my background is too theoretical (and my field is Numerical Analysis of PDE). From the classes I took in college, three or four on each of (1-4), a person with a similar background can recognize the tools without much digging. Maybe some folks here can provide some insights into books that center on applications. So I'm trying not to diverge into too much theory (i.e. for measures, [4] instead of Folland). There also seems to be good use of Analysis techniques in the paper, see theorem 2.1.
I love that the paper references the Moore-Penrose pseudo-inverse, an object of study in both statistics and optimization for which I had to give a lecture for a course.
The term measurable is referring to "measurable functions" in measure theory, which correspond to functions verifying that the pre-image of any measurable set belonging to the sigma-algebra of the codomain belongs to the sigma-algebra of the domain (https://en.wikipedia.org/wiki/Measurable_function). I do not know how to state it in simpler terms, sorry. When the measure of the domain is 1 (as in a probability space), we call measurable functions random variables, hence their relevance to this topic.
Now, tempered distributions are functions that assign a complex number to a very rapidly decaying function (a Schwarz space function), and it satisfies linearity properties. So this is a function that takes functions and maps them to complex numbers. https://secure.math.ubc.ca/~feldman/m321/distributions.pdf
Are they talking about the Borel sigma-algebra generated by the open sets of a topological space? What topology is in their mind?
Are tempered distributions functions? How does one compose two tempered distributions? (Hint: you can't, and they never actually use tempered distributions.)
This is just mathematical masturbation.
Everything is finite when implemented on a computer so there is no need for such dainty mathematical niceties unless you are trying lend credence to pedestrian observations about Chebyshev's inequality.
If not further specified, the topology is induced by the metric or norm of the space to be considered.
Tempered distributions are used in Subsection 3.1, resulting from the observation that the Fourier transform of a shallow neural network involves a Dirac delta.
Some mathematical concepts are needed in order to present rigorous results. While one can argue about the necessity and relevance of these results for real-world applications, they at least explain various aspects of deep learning in restricted settings, leading to a better general understanding and intuition.
Nyet, and nyet. This is why conscientious authors define the terms they use. A tempered distribution is a linear functional on a space of differentiable functions, for example, D_x(f) = f'(x), the derivative of f at x. This is why tempered distributions cannot be composed.
Mostly analysis. If you understand section 1 notations, you are obviously set. But even if you don't you should still be able to get the ideas with a bit of mental translation. In a word the notation seemed unnecessarily heavy for the level of discussion.
Deep learning papers often use math in a way that obscures rather than enlightens. And when you finally understand what they are saying, you realize it's not interesting at all, or they made a mistake in the math.
I would recommend a solid background in linear algebra, probability theory, and analysis. Moreover, for some sections, it is helpful to have experience with functional analysis, optimization, and statistical learning theory.
Looks like a little bit of everything except the likes of abstract algebra, logic, category theory.
These include linear algebra, graph theory, probability, algorithms, mathematical analysis, topology, differential geometry. But the most important prereqs are math maturity and mental toughness/endurance.
After skimming through the paper it's clear that the title should be read as "The Modern (Mathematics of Deep Learning)" and not my original parse which was "The (Modern Mathematics) of Deep Learning." Very different things.
I believe that the curse of dimensionility doesn't apply here as we are optimizing the "universal apppriximator" of the "surface" of the possible real world function.
> Kernel methods owe their name to the use of kernel functions, which enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space.
- https://en.wikipedia.org/wiki/Kernel_method
IE, Deep learning is fundamentally just about getting the mathematically simple but complex and multi-layerd "neural networks" to do stuff. Training them, testing them and deploying them. There are many intuitions about these things but there's no complete theory - some intuitions involve mathematical analogies and simplifications while other involve "folk knowledge" or large scale experiments. And that's not saying folks giving math about deep learning aren't proving real things. It's just they characterizing the whole or even a substantial part of such systems.
It's not surprising that a complex like a many-layered Relu network can't fully characterized or solved mathematically. You'd expect that of any arbitrarily complex algorithmic construct. Differential equations of many variables and arbitrary functions also can't have their solutions fully characterized.