The swiss roll problem also illustrates nicely the idea behind deep learning.
Before deep learning people would manually design all these extra features sin(x_1), x_1^2, etc. because they thought it was necessary to fit this swiss roll dataset.
So they would use a shallow network with all these features like this: http://imgur.com/H1cvt8d
Then the deep learning guys realized that you don't have to engineer all these extra features, you can just use basic features x_1, x_2 and let the network learn more complicated transformations in subsequent layers.
So they would use a deep network with only x_1, x_2 as inputs:
http://imgur.com/XBRjROP
Both these approaches work here (loss < 0.01). The difference is that for the first one you have to manually choose the extra features sin(x_1), x_1^2, ... for each problem. And the more complicated the problem the harder it is to design good features. People in the computer vision community spent years and years trying to design good features for e.g. object recognition. But finally some people realized that deep networks could learn these features themselves. And that's the main idea in deep learning.
I think I learned more from your post and your two imgur links than from poking at the site for an hour. Thanks.
Would it make sense for them to add a gallery of good solutions for each problem, or would they all basically be your second example network (no time to play and see for myself right now)?
>Before deep learning people would manually design all these extra features sin(x_1), x_1^2, etc.
It's probably worth pointing out that this is true for ANNs, but there were (and are) other "shallow" classifiers that can handle swiss roll problem without manual parameter encoding. SVMs, for example.
Just so I understand correctly: your network has 100000 iterations, while the parent's has 1000, but they both only use x / y positions?
It feels like neurons in the first layer are weaker, because all they can do is a linear separation. Given deep networks, I was wondering if adding neurons to the first layer was better than adding them to the last one, and empirically, it feels like it is quite worse. I wonder if there is a theorem around that.
> your network has 100000 iterations, while the parent's has 1000, but they both only use x / y positions
Correct, but keep in mind that their method appears to use batch descent while mine does not. Batch descent is often converges more quickly. There are other differences between my net and the GP's I can spot as well (e.g., the activation function, the learning rate, and regularization).
Also keep in mind that I threw this together over breakfast, and did not spend much time tweaking parameters :)
What I found interesting is that I couldn't get a proper fit with the same parameters you showed... however, I could 'speed up' the learning by regenerating the data during the learning process.
It may just be that 'batched cumulative learning' (I don't know if there is already a term for this) gets a better fit than just learning from a smaller set of data.
Edit: Did a quick test; regenerating about every 50 and 100 iterations, and conversion does seem faster (at least, when a clear spiral is formed). https://imgur.com/a/OPjXb
Regenerating the data is kind of cheating; it is as if you were given twice the amount of data.
In a normal situation, you obtain a list of input / output (say, images as input, a digit as output, for learning handwritten digits). You separate it between training data (which actually improves the net) and testing data (to detect overfitting), and you don't get more data than that.
Here, you can generate more data for free, as we have the function we want to approximate. Having more data will often result in a better result and faster convergence.
I started reading about ANNs in the 1980s, and had similar confusion to those here, since it was just for fun. I suggest reading a basic book or online information that goes over the basics [1]. I struggled through $200 text books, and jumped from one to the other as an autodidact. I am now studying TWEANNs (Topology and Weight Evolving Artificial Neural Networks), which basically are what you see here with the exception that they are able to not only change their weights, but also their topology, that is how many and where the neurons and layers are. ANNs (Artificial Neural Networks - as opposed to biological ones) can be a lot of fun, and are very relevant to machine learning and big data nowadays. It was exploratory for me. I used them for generative art and music programs. Be careful: soon you'll be reading about genetic algorithms, genetic programming [2], and artificial life ;) Genetic Programming can be used to evolve neural networks as well as generate computer programs to solve a problem in a specified domain. Hint: You'll probably want to use Lisp/Scheme for genetic programming!
As far as the recent deep learning boom is concerned, genetic programming is really out of favor. I don't really see it in any of the deep learning (or even machine learning, for that matter) literature/successes/research groups.
"Neural networks" are a really really overloaded term. A ton of stuff referred to as "neural networks" has little to do with the "neural networks" that are used in the machine learning community.
You're spot on about genetic programming. I am a self-taught person who plays with anything that strikes my fancy; I learn by playing. I read all three volumes of the Artificial Life series from the Santa Fe Institute at the time (now there are more), and went in many directions in the 1990s - Fuzzy Logic, Expert Systems, ANNs, and Evolutionary Computation (GA (Genetic Algorithms) and GP Genetic Programming), and AL (Artificial Life) all fascinating. I found, and still find, genetic programming attractive even if it has not found its niche in the ML community. I think the CI (Computational Intelligence) community at large will eventually develop well-fitted uses for it. I was trying to use an FPGA and Koza's modified GP code to have the FPGA re-program itself as a GP evolved a better program than I originally wrote to kickstart it. I didn't get too far. This was 1996-97 though. Pretty much on my own then, not really much of an Internet to find information, especially esoteric information, or cheap many-gated FPGAs.
Outside of ML, GP has found moderate success. One example is this paper (sorry behind paywall, so only the paper title here), that started with using expert data, tried ANNs, then ANNs and statistics, until it used a GP approach:
"A Computational Intelligence-Based Genetic Programming Approach for the Simulation of Soil Water Retention Curves"
I also use the term ANNs over just NNs to keep it to the silicon, and not wetware ;) Although, they did hook up a small ANN to a cockroach once, IIRC...
It has its niche applications. The only non machine vision application that comes to mind is one[1] that takes a pile of data, and evolves a model that fits it.
Generally were its actually being used they are a bit quiet on how they go about getting the results they do. While the genetic bit is easy, the secret sauce is in guiding learning/evolution that work for the particular problem domain.
Yes, it is a shame people don't share their advances in science and technology for fear of losing market share usually. Sharing grows the market, and then there's more pie for everyone, and more work gets done to advance the field.
Still that point, nor how successful GP is in the ML community, measures its current or future potential. The book I am working my way through now, in LFE (Lisp Flavored Erlang vs. Erlang, or Elixir), is "The Handbook of Neuroevolution Through Erlang" by Gene Sher [1]
Gene covers a lot of ground. Somebody has done some transliteration to Elixir too; I use LFE, since staying with Lisp bridges the gap between my GP work, and what Gene has done with Erlang and ANNs and EC. For GP, you really need to be able to create new forms with macros, or it is more in line with GP. To quote and excerpt from Robert Virding, co-designer of Erlang, and creator of LFE,addressing Elixir's macros or messing with Erlang's modules vs. LFE's or Lisp's macros on HN before:
"There is syntactic support for making the function calls look less like function calls but the macros you define are basically function calls.
In Lisp you are free to create completely new syntactic forms. Whether this is a feature of the homoiconicity of Lisp or of Lisp itself is another question as the Lisp syntax is very simple and everything basically has the same structure anyway. Some people say Lisp has no syntax." [2]
GSD, also known in the literature as "Graduate Student Descent."
I'm not even joking. Trial and error. Having good "intuition" about past ideas the basic building blocks to guide that trial and error. Reading research papers and seeing what other people did well with and using that.
As an aside, this is the principal reason I am skeptical of grandiose claims about deep learning.
Regularisation methods like dropout are often good enough that you can build a network with too many parameters (for the amount of data you have) and rely upon the regularisation to find the subset of that network that is actually useful. People have recently got good results from also randomly dropping weights, or even whole layers.
Probably also through some grid search.
I've read (but not rememeber where) that Random Search gives very good results, even better than grid (in less time).
Any thoughts on why genetic programming is not 'in fashion'? Does it have anything to do with complexity of the calculations?
I can imagine that the advanced models use many, many machines and only deliver results after a large training time. Genetic programming is not feasible then, if you cannot get a quick grasp of the potential results of a model.
At least for deep learning, most deep learning models take more than a week to train, often on multiple GPUs. Some of the extremely deep, huge dataset models can take multiple weeks on multiple GPUs. Google trained AlphaGo's nets for months (on god knows how many GPU/CPUs). Suffice to say, people don't even bother touching most hyperparameters, let alone trying to do something more exhaustive.
If your program is a neural network with N parameters, or a program tree with N nodes, then testing against data takes O(N) time. With evolutionary computation, what you get for your trouble is a single real number -- the loss: how bad it did. With neural networks, backpropagation gives you N real numbers: the gradient of loss with respect to each parameter.
Put another way: with evolution you have to stumble around blindly in parameter space and rely on selection to keep you moving in the right direction. With the gradient descent that neural networks use, you get, essentially for free, knowledge of the (locally) best direction to move in parameter space.
The bigger the models, the more this matters. Modern neural networks have millions or even billions of parameters, and that's been crucial to their expressive power. Good luck learning a program tree with a billion nodes using evolution. It might take 4.54 billion years.
And then only if you have a system powerful enough to accurately simulate a planet full of molecules.
Although I do think there is a balance between GA and structured NN which will lead to faster and better results than the deep NN alone. We already see some of the best deep NNs incorporating specific structures.
I think neural networks and other forms of evolutionary computation will merge as I have been writing in my other replies in this thread. TWEANNs incorporate EC into evolving ANNs. The other article I cited above on soil mechanics, beat out expert systems, ANNs, statistics, and used GP. MEP, or Multi-Expression Programming for GP incorporates being able to put more than one solution into a gene without increasing the processing times thereby overcoming the inefficiencies of 1990s-era GP. Here is a recent article using it that is not behind a paywall or via sci-hub.io [1]. It needs better editing, but there are other references if you search for Multi-expression Genetic Programming.
First, the right tool for the job. ANNs are able to be a general function approximator with sufficient training to be a cost-effective choice to implement. Second, ANNs have been around about 35 years longer than GP.
The TWEANNs I am studying, and that I already mentioned in a previous reply in this thread, hybridize ANNs and EC (GAs and GP), so if you include Neural Networks that utilize Evolutionary Computation techniques to modify weights or topology, then GP is being used to an extent.
Replication as a variable in EC is the key force in biology, and I only see more use of EC techniques to enhance the general function approximators that are ANNs. Further, there are also hybridized computing machines that have been made, and are being made with FPGAs and GPUs. Finance and supercomputing are just two areas that are looking to utilize them. In some, the FPGAs are simply there for updating special computation programs that feed the GPUs. There is some research with a GP optimizer updating the FPGAs and then using the GPUs for the massive parallelization of the computations.
Evolutionary algorithms and genetic programming are global optimization technique, basically random search with some memory. It's not "out of fashion" any more than simulated annealing or Monte Carlo methods. They have limited usability, that's all.
>Topology and Weight Evolving Artificial Neural Networks
I brainstormed for a while about using genetic algorithms to decide the network topology. I'm glad someone else invented that already! Less work for me to do now.
Okay, that is straight up awesome. I've been toying with neural networks just enough to get a basic understanding of what they are and how they work, and it occurred to me that something like this might be possible.
Of course, I wasn't up-to-speed enough to know the right terms to look for, so thanks for sharing. :)
I am curious though... it seems like it would take orders of magnitude more computing power to not only train but evolve and re-train the networks. Is this practical with today's hardware?
When it says "right here in your browser," it's not joking. On my desktop (Safari), the window becomes unresponsive after a few iterations. Does not happen in Chrome.
On my phone (Safari/iOS 9.3), the default neural nework doesn't converge at all even after 300 iterations while it does on the desktop, which is legit weird: https://i.imgur.com/KNaXeHH.png
I'm sorry you're having problems with Safari. I can't reproduce on my end, but if you're still having problems you can raise an issue on github with some information about your system.
This demonstration goes really well with Michael Nielsen's http://neuralnetworksanddeeplearning.com/. At the bottom of the page the author gives a shout out to Nielsen, Bengio, and others.
For someone (like me) who's done a bit of reading but not much implementation, this playground is fantastic!
Neat stuff, fun to play with. I wasn't able to get a net to classify the swiss roll. Last time I was playing around with this stuff I found the single biggest factor in the success was the optimizer used. Is this just using a simple gradient descent? I would like to see a drop down for different optimizers.
This actually makes the dataset harder to fit to. It is not the same thing here as the "training with noise" method where random noise would be added to each batch, as an alternative means of Tikhonov regularization.
wih that particular data set, it looks like it really just adds more data, and more importantly, fills in the gaps along the spirals which is where my setup was having troubles.
The noise doesn't go far enough to start confusing points between different clusters, but it adds more points.
That said, my knowledge of neural nets is fairly limited.
Using the defaults, I had success at about 300 iterations with all the inputs and 5 hidden layers, each with a decreasing number of neurons (i.e. 6,5,4,3,2).
I don't know if that's a general feature to need fewer neurons with each layer, but that seems to work here.
What were the optimization algorithms you had most success with? Were they more successful in the sense of better out-of-sample error rate or in the sense of quicker convergence (or something else)?
Hopefully this helps (correct me if I'm wrong, I'm still learning about neural nets):
Think of the whole neural net as a function:
input * weight = output
At each iteration, we feed in the input to the neural net. Then the neural net compares what output it gets to the correct output.
For example, input1 is 5, and the correct output for input1 should have been 2. But the neural net got 3 as the output. So it then decreases the weights slightly so it would get 2.75 next time it has input of 5. Repeat thousands of times. That's the basic idea for machine learning and neural networks.
The algorithm it uses to figure out how much to decrease the weights is called "backpropagation" which uses gradient descent. To explain gradient descent, as as a roller coaster track. Imagine the roller coaster starts off on a random location on the track. Then gravity takes the roller coaster down the track until it ends up on a low point between two hills and stays there. This is the new location of the roller coaster. This new location is nice because it has the lowest energy the roller coaster could find, so it stays there. (We use derivatives to figure out the slope of a curve, which then gives us the direction where the curve goes downhill).
In neural networks, the roller coaster curve is the "cost function", which basically calculates the amount of difference between the neural net's output and the actual correct output it should have got. The initial weight is the roller coaster's initial position. The new weight is the roller coaster's final position, at the bottom of the cost function curve. This new position thus gives us the lowest cost.
Note that there may be even lower valleys, but when we roll the rollercoaster it stops at its nearest low valley. This is why we randomize the weights at the beginning - to put the roller coaster near possibly even lower valleys.
Okay, so it works by minimizing (equiv. maximizing) some function. But that doesn't say much about how it "learns" the gradient. What function does it care about? Average squared error (predict_prob-Z_i)^2 ? Average absolute error? The likelihood function of some assumed distribution? Maximum distance between the classification border and closest observed points? If I saw someone carrying a bag full of blueberries and some bread home from the grocery store and asked to know how they chose to buy that, to which they replied "I had a list of characteristics which I thought where important for groceries to have in this trip to the store. For each grocery item, I recorded a vector of degrees to which the item possesses each of those characteristics. Finally, I chose the group of groceries that had the best combination of degree vectors", I still wouldn't really know anything about why they bought the blueberries and bread.
The function it minimizes is called the "loss function", and its value for the training and test sets are shown in the upper right area. AFAICT the site doesn't say how it's computed, but I think it's average squared error. The gradient is not learned; if you think of the loss function as a real-valued function of the weights, the gradient is just the partial derivatives with respect to the weights.
It's training a neural network to classify a data set with two classes (orange or blue) and the data has two features (x1 or x2). All the orange and blue dots are the training data. So if you take a dot on the graph with coordinates (-2, 4) and it's blue, that would mean that a data point with x1 = -2 and x2 = 4 has the class blue.
You can think of a neural network as a function that can take in arbitrary features (in this case x1 and x2) and tries to output the correct class. That's what the orange and blue colors in the background are, the neural network's guess at the correct classification for any given point (x1, x2).
When you hit play, it iterates through the training data making adjustments to each neuron in the network so that it gets closer to predicting the right class.
If you want to see how well the neural network performs on data it wasn't trained on, you can click "show test data".
Yeah I feel like we need some decent understanding of neural networks to have more context on this. Its kind of like being given a specialized shovel but not knowing why you need it or why you should dig holes.
I think it's a playground in the best sense of the term. Take some time and actually play with it, and a lot of fun stuff happens, lightbulbs go off, etc.
If you're expecting a lesson, you'll likely be disappointed, but I think there's real value in a true playground.
I think the biggest improvement would be if, when hovering over a 'neuron', you get a visual representation of what feeds into it.
For me in Chrome on OSX you do get a visual representation of the neuron's input when hovering. It shows up behind the data points in place of the neurons' output when hovering.
It begins training the network using the backpropagation algorithm.
> Next, the network is asked to solve a problem, which it attempts to do over and over, each time strengthening the connections that lead to success and diminishing those that lead to failure.
On each iteration, it calculates how bad the predicted output is, then adjusts the weights between neurons to lessen that value. Google backpropagation for more info
Or perhaps explain how all the different inputs influence the result? I more or less get that it's just iterating over the data to approximate the given data set when you press play but I have no idea how giving it more or less neurons changes that, to name an example.
Basically each input gets multiplied by some weight that gets adjusted through each iteration. The product of the input and weight gets put through an activation function, and the outcome of that can be interpreted as the network's prediction of the class.
So you see the first neuron's input is just x1. You can see in the little graph at x1 that it's split down the middle with orange on one side and blue on the other. You can think of adjusting the weight on that neuron as adjusting where along the x axis the split occurs. All points on the orange side are classified orange and all on the blue side are classified blue. If you picked a data set like the spiral one or whatever, that neuron alone isn't going to make very many correct classifications. That's because it only gets the x1 value as input and can only affect the output by multiplying x1 by some weight, which would only have the affect of shifting the classification boundary left or right. You can see the same thing happening for the second neuron with input x2 except that now it splits along the y axis. Again that alone isn't going to match the data very well.
But then you get to the second layer, and the input of each neuron in the second layer is the output of each neuron in the first layer. So these neurons are able to take into consideration both x1 and x2 and are able to divide the data in more complex ways. So you can think of the neurons in each layer of the neural network as being able to consider more and more complex properties of the data in forming its output.
The chart at the right is the output/result of the neural network's training. In the foreground you see actual data points that are used to train the neural net: to "teach" it how to classify orange or blue (unless you choose "regression" in which case it computes a numeric value). In the background you see the gradient that is formed by the network. The goal is to make the gradient form around the data points by color as closely as possible.
The neural network is essentially the nodes in the middle, linked together by various weights. During training, the test data points are fed forward into the network, creating an output. That output is then fed backward using something called "back propagation" which is used to adjust the weights.
Typically, the more hidden layers or nodes per layer, the more difficult gradients that can be learned. Zero hidden layers essentially forms a linear gradient that can only be used to split very basic, linearly-separable data (drawing a straight line to separate the different types)
Neural networks have lots of little knobs and levers you can adjust. That's what all these inputs are that you see.
this is very nice! I think that the reason swiss roll doesn't work as easily might be because of initialization. In 2 dimensions you have to be very careful with initializing the weights or biases because small networks get more easily stuck in bad local minima.
I'm pretty sure he wasn't talking about the swiss roll specifically. Big gains in neural net performance have been made through better initialization schemes (not dataset specific, just in general, e.g. an initialization scheme might adapt the initial weight distribution depending on the number of hidden units in the next layer), and smaller models are in general more sensitive to initialization.
Update: after playing with this for way too long, I've found that it can converge to a spiral with 3 or 2 or even just 1 node in the 2nd hidden layer.
The 1 node case is especially interesting, because when it converges the single node must learn the whole spiral pattern. Although with noise it can be less reliable with more jagged edges, as well as take longer to converge (also bumped the learning rate down), seeing the spiral encoded directly in the 2nd hidden layer is more interesting to me.
For this simple example just choosing the largest possible fully-connected network with ReLU and L2 regularization to prevent overfit quickly converges to a nice spiral (test loss of 0.001 for me):
You could totally optimise network architecture by crowdsourcing topology discovery for different problems into a multiplayer game with loss as a score.
Eventually it will have to be recognized as a new species of life, so I hope programmers, tinkerers and everyone else keeps that in mind because all life must be respected
And this particular form will be our responsibility, we can either embrace it as we continue to merge with our technology, or we can allow ourselves to go extinct like so many other species already have
For the naysayers - ever notice how attached we are to our phones? Many behave as if they are missing a limb without it - it's because they are, the brain adapts rapidly and for many, the brain has adapted to outsourcing our cognition. It used to be books, day runners, journals, diaries - now we have devices and soon they'll be implants or prosthetics
The writers at marvel who came up with the idea of calling iron man's suit a prosthetic were definately onto something and suits like that are probably our best chance of successful colonization of other planets. We'll need ai to be our friend out there, working with us
Using sin(x) or the other input features like x^2 goes back to making it too easy, though. So far the best I can do is 7 layers of 7 which gets a loss of 0.02. 3x7 is almost cracking the Swiss Roll but can't quite finish it off and gets stuck at 0.05: https://imgur.com/Z3f2ECc ... Surprisingly, 2x8 can do it, as long as I have noise or regularization on, but 8/7 then seriously struggles. Is 16 neurons a critical limit here?
I managed to get to 0.01 loss from only x1/x2, using 3 hidden layers, L1 regularization, a bit of added noise, and some patience: http://i.imgur.com/Y3zKpJF.png
Yes, noise & regularization seem to be key here. I've gotten a 2-layer with 7/8 neurons down to 0.06 and dropping but only with noise & l1: http://playground.tensorflow.org/#activation=relu®ulariza... Final loss of 0.051. Interestingly, increasing noise from 10 to 15 destroys performance, loss of 0.47.
Is it really "making it too easy" if you're applying your knowledge of the structure of the problem space to make it easier for the computer to solve? Certainly this isn't easy to do with every problem, but it seems like a better idea in general to start with parameters you suspect to be correct.
In the "swiss cake roll" the circular nature of the classes suggests using a sin or cos function, and the fact that they spiral out suggests also inputting magnitude information. Sure, you can just add more neurons that will end up computing the same thing, but we might as well give the computer a head start when we can.
If you like visual demonstrations of ML topics, you may be interested in http://ponder.hepburnave.com. It is an interactive demonstration of a self-organizing map, generating a 2D-map from a spreadsheet with multivariate data. It's an unsupervised learning approach, good for data exploration tasks, less so for classification tasks (/shamelessPlug).
I'm not well versed in neural networks but a lot of the new neural network software stacks coming out seem to be quite plug and plug. What kind of expertise would engineers need to have a few years from now when the technology is well developed and it doesn't need to be rewritten from scratch every time?
I'm not qualified to answer this, but I will anyway.
To "operate" neural networks (as opposed to writing a framework for them), you need to know the building blocks. There are basic blocks like fully connected layers, convolutions, and nonlinear activations. Beyond those, there are higher level building blocks like LSTMs[1], gated recurrent units[2], highway layers[3], batch normalization[4], and residual blocks[5] that are made up of simpler blocks. Learning what these do and when it's appropriate to use them requires following current literature.
Operating neural networks requires some systems engineering skill. It takes a long time to train a single network and you'll find yourself trying many different architectures and hyperparameters along the way. Because of this, you'll want to distribute the training across many different systems and be able to easily monitor and deploy jobs on those systems.
A solid grasp of mathematics is useful to effectively debug your networks. You'll frequently find your network doesn't converge or gives totally garbage results, so you need to know how to dig into the network internals and understand how everything works. This is especially true if you're implementing a new building block from a paper.
Finally, know your machine learning and statistics fundamentals. Understand overfitting, model capacity, cross validation, probability, model ensembles, information theory, and so on. Know when a simpler model is more appropriate.
So you don't think some of these details will not be automated away in the near future so that it doesn't require a specialist to do operate a neural network?
Already, it's not nearly as hard as this demo makes it look. There's one recent advance in particular that isn't in this demo, and that is Batch Normalization.
If you've played around with it a bit, I'm sure you have seen that deeper layers are hard to train... You see the dashed lines representing signal in the network become weaker and weaker as the network gets deeper. BatchNorm works wonders with this. It takes statistics from the minibatch of training examples, and tries to normalize it so that the next layer gets input more similar to what it expects, even if the previous layer has changed. In practice you get a much better signal, so the network can learn a lot more efficiently.
Without BatchNorm, more than two hidden layers is tedious and error-prone to train. With it, you can train 10-12 layers easily. (With another recent advance, residual nets, you can train hundreds!)
Such advances pushes the limit for what you can train easily, and what still requires GSD ("graduate student descent", figuring out just the right parameters to get something to work through intuition, trial and error). You still have to watch out for overfitting, but the nice thing about that is that more training data helps.
I think most of those things will remain important:
+ Designing the network architecture is a means to instill your knowledge of the problem into the network. For example, using convolutions over images encodes some translational invariance into the network. It makes up for lack of data. I don't think data augmentation alone is enough, either: if you use a "stupid" architecture with heaps of data, the computation will become too expensive or slow.
- The systems engineering part will probably get automated. I bet there are Amazon engineers crying at their desks while working on AWS Elastic Tensorshift right now. So unless you're specifically interested in that side of things, maybe this isn't the best area to focus on.
+ There are always going to be problems, so knowing how to debug is a useful skill.
+ ML/stats fundamentals aren't going away. You need to know what you're trying to do before you can do it.
Before deep learning people would manually design all these extra features sin(x_1), x_1^2, etc. because they thought it was necessary to fit this swiss roll dataset. So they would use a shallow network with all these features like this: http://imgur.com/H1cvt8d
Then the deep learning guys realized that you don't have to engineer all these extra features, you can just use basic features x_1, x_2 and let the network learn more complicated transformations in subsequent layers. So they would use a deep network with only x_1, x_2 as inputs: http://imgur.com/XBRjROP
Both these approaches work here (loss < 0.01). The difference is that for the first one you have to manually choose the extra features sin(x_1), x_1^2, ... for each problem. And the more complicated the problem the harder it is to design good features. People in the computer vision community spent years and years trying to design good features for e.g. object recognition. But finally some people realized that deep networks could learn these features themselves. And that's the main idea in deep learning.