The author probably shouldn't have listed those as requirements - it's not really needed to understand the article.
Classification (in context of the tutorial) is determining whether the data corresponds to a 'male' or 'female' (where 'male' and 'female' are the 'class labels')
The purpose of 'Regularization' is to coax the network away from simply fitting an exact match to the training data (this wouldn't be useful because it would do poorly at any new data). In the context of the tutorial, they add the L1 norm to the 'penalty function', so the network penalizes larger weights.
The pictures do a pretty good job of explaining everything else.
I really enjoyed the books "Learning From Data"[1], and "Programming Collective Intelligence"[2].
Both are accessible to beginners.
Learning From Data gives a more theoretical introduction to machine learning. One of the central ideas from the book that I still think about often is that machine learning is merely function approximation. There exists a function which will drive a car perfectly, but we don't know what that function is, so we try to approximate that function with machine learning.
Programming Collective Intelligence is a more hands-on introduction to machine learning. The book has examples in Python, but I believe the Python code is low quality. Ignoring the example code (and I did ignore it), the book is a very enjoyable introduction to many different machine learning algorithms. If you don't know the difference between linear regression, nearest-neighbors clustering, support vector machines, and a neural networks, this book will explain how each of these work and give a good intuition about when to use each.
I am with you, but then I thought "why should I expect to learn something about this advanced topic, when my skills are not at that level?" So, I don't blame the author, this post is meant for people that have some pre-existing expertise.
I have some experience in ML and still am with you on the "where can I find training wheels?" department.
Sort of a high-level overview (some of this is particular to neural nets as they're implemented in JMP, a product made by SAS, I don't know how well they generalize):
• Neural nets consist of an input layer, an output layer, and some number of hidden layers between the input and output layers
• The hidden layers consist of a number of nodes, which can be thought of as “on-off” switches (called a “step function” if we’re using the proper term), although in practice they’re represented by a smooth sigmoid curve with an upper and lower bound that represent on and off. In JMP, this takes the form of a hyperbolic tangent function, which is bound between -1 and 1.
• Each node has a value that’s called the bias, which if you’re thinking of the nodes as an on-off switch, is the threshold a linear combination of the inputs needs to reach to turn the node on. In the sigmoid curves output by JMP, the bias represents the point at which the hyperbolic tangent function returns 0, given a linear combination of the inputs.
• A “deep learning” model is one that has more than one hidden layer; the lowest hidden layer serves as the input layer to the next hidden layer, and so on.
• The output layer takes the nodes of the highest hidden layer and produces an output. When dealing with probability, the output layer can be thought of as behaving identically to the hidden layer – a number of nodes that takes the inputs and produces a sigmoid curve. In JMP, the output layer is represented by the logistic sigmoid function, which is bound at 0 and 1. There is an equivalent logistic sigmoid function for every hyperbolic tangent function, so despite using different formulas, they behave the same way in practice.
• Rather than fitting to an entire data set, neural nets are “trained” one data point at a time, where an initial sets of weights are determined (I believe randomly), and then for each data point in the training data, the amount of error is “backpropagated” among the various inputs, adjusting them. There’s a “training weight,” so the weights only move about 10% of the amount needed to reduce the error to zero.