Tinker with a Neural Network in Your Browser

erostrate · on April 13, 2016

The swiss roll problem also illustrates nicely the idea behind deep learning.

Before deep learning people would manually design all these extra features sin(x_1), x_1^2, etc. because they thought it was necessary to fit this swiss roll dataset. So they would use a shallow network with all these features like this: http://imgur.com/H1cvt8d

Then the deep learning guys realized that you don't have to engineer all these extra features, you can just use basic features x_1, x_2 and let the network learn more complicated transformations in subsequent layers. So they would use a deep network with only x_1, x_2 as inputs: http://imgur.com/XBRjROP

Both these approaches work here (loss < 0.01). The difference is that for the first one you have to manually choose the extra features sin(x_1), x_1^2, ... for each problem. And the more complicated the problem the harder it is to design good features. People in the computer vision community spent years and years trying to design good features for e.g. object recognition. But finally some people realized that deep networks could learn these features themselves. And that's the main idea in deep learning.

beardicus · on April 13, 2016

I think I learned more from your post and your two imgur links than from poking at the site for an hour. Thanks.

Would it make sense for them to add a gallery of good solutions for each problem, or would they all basically be your second example network (no time to play and see for myself right now)?

romaniv · on April 13, 2016

>Before deep learning people would manually design all these extra features sin(x_1), x_1^2, etc.

It's probably worth pointing out that this is true for ANNs, but there were (and are) other "shallow" classifiers that can handle swiss roll problem without manual parameter encoding. SVMs, for example.

http://cs.stanford.edu/people/karpathy/svmjs/demo/

conceit · on April 14, 2016

needs another image link for visualization

amelius · on April 13, 2016

But how will the number of neurons N grow with the number of turns in the spiral?

If N levels off, then the network has grasped the concept of a spiral and can generalize to arbitrary size.

If N doesn't level off, then the network isn't really learning the general case.

therein · on April 14, 2016

I know this is going to sound cheesy but that's an amazing way to put it. It blew my mind.

chestervonwinch · on April 13, 2016

Using their network, you are limited to 8 units per layer it seems.

So, I ported their swiss roll dataset to python and threw together a shallow network trainer with theano:

https://gist.github.com/notmatthancock/68d52af2e8cde7fbff1c9...

Then, I trained a shallow network with 36 hidden units (your deep net has 6 units and 6 layers):

http://i.imgur.com/I0pXaTK.png

edit: I forgot to mention that the shallow network above takes only the two coordinates (x1 and x2) as input features.

espadrine · on April 13, 2016

Just so I understand correctly: your network has 100000 iterations, while the parent's has 1000, but they both only use x / y positions?

It feels like neurons in the first layer are weaker, because all they can do is a linear separation. Given deep networks, I was wondering if adding neurons to the first layer was better than adding them to the last one, and empirically, it feels like it is quite worse. I wonder if there is a theorem around that.

chestervonwinch · on April 13, 2016

> your network has 100000 iterations, while the parent's has 1000, but they both only use x / y positions

Correct, but keep in mind that their method appears to use batch descent while mine does not. Batch descent is often converges more quickly. There are other differences between my net and the GP's I can spot as well (e.g., the activation function, the learning rate, and regularization).

Also keep in mind that I threw this together over breakfast, and did not spend much time tweaking parameters :)

tchow · on April 13, 2016

How do you know to choose 6 hidden layers with 6 neurons each though? Why not 'x' hidden layers with 'j' neurons each? or some other random number?

Also how do you know to choose a ReLu instead of a Tanh activation?

espadrine · on April 13, 2016

ReLu gives good results for deep learning: http://jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.p....

6 layers is the maximum that this demonstration allows, and they kept j small-ish to show that you don't need that many to have good results.

rolandog · on April 13, 2016

What I found interesting is that I couldn't get a proper fit with the same parameters you showed... however, I could 'speed up' the learning by regenerating the data during the learning process.

It may just be that 'batched cumulative learning' (I don't know if there is already a term for this) gets a better fit than just learning from a smaller set of data.

Edit: Did a quick test; regenerating about every 50 and 100 iterations, and conversion does seem faster (at least, when a clear spiral is formed). https://imgur.com/a/OPjXb

espadrine · on April 13, 2016

Regenerating the data is kind of cheating; it is as if you were given twice the amount of data.

In a normal situation, you obtain a list of input / output (say, images as input, a digit as output, for learning handwritten digits). You separate it between training data (which actually improves the net) and testing data (to detect overfitting), and you don't get more data than that.

Here, you can generate more data for free, as we have the function we want to approximate. Having more data will often result in a better result and faster convergence.

raverbashing · on April 13, 2016

This is a very good explanation, thanks (even though I knew some of it already)

I tried the swiss roll with a shallow network on the demo (and the results are not excellent, but it matches)

iopq · on April 13, 2016

I can reproduce your deep example just fine, but the shallow result needs some luck. At the same time, the shallow result runs faster.

kriro · on April 13, 2016

Along with the images that is a very awesome explanation.

eggy · on April 13, 2016

I started reading about ANNs in the 1980s, and had similar confusion to those here, since it was just for fun. I suggest reading a basic book or online information that goes over the basics [1]. I struggled through $200 text books, and jumped from one to the other as an autodidact. I am now studying TWEANNs (Topology and Weight Evolving Artificial Neural Networks), which basically are what you see here with the exception that they are able to not only change their weights, but also their topology, that is how many and where the neurons and layers are. ANNs (Artificial Neural Networks - as opposed to biological ones) can be a lot of fun, and are very relevant to machine learning and big data nowadays. It was exploratory for me. I used them for generative art and music programs. Be careful: soon you'll be reading about genetic algorithms, genetic programming [2], and artificial life ;) Genetic Programming can be used to evolve neural networks as well as generate computer programs to solve a problem in a specified domain. Hint: You'll probably want to use Lisp/Scheme for genetic programming!

  [1] http://natureofcode.com/book/chapter-10-neural-networks/
  [2] http://www.genetic-programming.com

argonaut · on April 13, 2016

As far as the recent deep learning boom is concerned, genetic programming is really out of favor. I don't really see it in any of the deep learning (or even machine learning, for that matter) literature/successes/research groups.

"Neural networks" are a really really overloaded term. A ton of stuff referred to as "neural networks" has little to do with the "neural networks" that are used in the machine learning community.

eggy · on April 13, 2016

You're spot on about genetic programming. I am a self-taught person who plays with anything that strikes my fancy; I learn by playing. I read all three volumes of the Artificial Life series from the Santa Fe Institute at the time (now there are more), and went in many directions in the 1990s - Fuzzy Logic, Expert Systems, ANNs, and Evolutionary Computation (GA (Genetic Algorithms) and GP Genetic Programming), and AL (Artificial Life) all fascinating. I found, and still find, genetic programming attractive even if it has not found its niche in the ML community. I think the CI (Computational Intelligence) community at large will eventually develop well-fitted uses for it. I was trying to use an FPGA and Koza's modified GP code to have the FPGA re-program itself as a GP evolved a better program than I originally wrote to kickstart it. I didn't get too far. This was 1996-97 though. Pretty much on my own then, not really much of an Internet to find information, especially esoteric information, or cheap many-gated FPGAs. Outside of ML, GP has found moderate success. One example is this paper (sorry behind paywall, so only the paper title here), that started with using expert data, tried ANNs, then ANNs and statistics, until it used a GP approach:

"A Computational Intelligence-Based Genetic Programming Approach for the Simulation of Soil Water Retention Curves"

I also use the term ANNs over just NNs to keep it to the silicon, and not wetware ;) Although, they did hook up a small ANN to a cockroach once, IIRC...

extrapickles · on April 13, 2016

It has its niche applications. The only non machine vision application that comes to mind is one[1] that takes a pile of data, and evolves a model that fits it.

Generally were its actually being used they are a bit quiet on how they go about getting the results they do. While the genetic bit is easy, the secret sauce is in guiding learning/evolution that work for the particular problem domain.

[1]: http://www.nutonian.com/products/eureqa/

argonaut · on April 13, 2016

Yes, but all of the algorithmic advances in academia, and most of the advances at Google/Facebook, have been out in the open.

eggy · on April 13, 2016

Yes, it is a shame people don't share their advances in science and technology for fear of losing market share usually. Sharing grows the market, and then there's more pie for everyone, and more work gets done to advance the field. Still that point, nor how successful GP is in the ML community, measures its current or future potential. The book I am working my way through now, in LFE (Lisp Flavored Erlang vs. Erlang, or Elixir), is "The Handbook of Neuroevolution Through Erlang" by Gene Sher [1]

Gene covers a lot of ground. Somebody has done some transliteration to Elixir too; I use LFE, since staying with Lisp bridges the gap between my GP work, and what Gene has done with Erlang and ANNs and EC. For GP, you really need to be able to create new forms with macros, or it is more in line with GP. To quote and excerpt from Robert Virding, co-designer of Erlang, and creator of LFE,addressing Elixir's macros or messing with Erlang's modules vs. LFE's or Lisp's macros on HN before:

  "There is syntactic support for making the function calls look less like function calls but the macros you define are basically function calls.

In Lisp you are free to create completely new syntactic forms. Whether this is a feature of the homoiconicity of Lisp or of Lisp itself is another question as the Lisp syntax is very simple and everything basically has the same structure anyway. Some people say Lisp has no syntax." [2]

  [1] http://www.erlang-factory.com/upload/presentations/536/ErlangConferencePresentation_2012.pdf

  [2] https://news.ycombinator.com/item?id=7623991

ylem · on April 13, 2016

Just curious then, how are people optimizing network topology?

argonaut · on April 13, 2016

GSD, also known in the literature as "Graduate Student Descent."

I'm not even joking. Trial and error. Having good "intuition" about past ideas the basic building blocks to guide that trial and error. Reading research papers and seeing what other people did well with and using that.

As an aside, this is the principal reason I am skeptical of grandiose claims about deep learning.

samscully · on April 13, 2016

Regularisation methods like dropout are often good enough that you can build a network with too many parameters (for the amount of data you have) and rely upon the regularisation to find the subset of that network that is actually useful. People have recently got good results from also randomly dropping weights, or even whole layers.

haddr · on April 13, 2016

Probably also through some grid search. I've read (but not rememeber where) that Random Search gives very good results, even better than grid (in less time).

wjnc · on April 13, 2016

Any thoughts on why genetic programming is not 'in fashion'? Does it have anything to do with complexity of the calculations?

I can imagine that the advanced models use many, many machines and only deliver results after a large training time. Genetic programming is not feasible then, if you cannot get a quick grasp of the potential results of a model.

argonaut · on April 13, 2016

At least for deep learning, most deep learning models take more than a week to train, often on multiple GPUs. Some of the extremely deep, huge dataset models can take multiple weeks on multiple GPUs. Google trained AlphaGo's nets for months (on god knows how many GPU/CPUs). Suffice to say, people don't even bother touching most hyperparameters, let alone trying to do something more exhaustive.

DavidSJ · on April 13, 2016

If your program is a neural network with N parameters, or a program tree with N nodes, then testing against data takes O(N) time. With evolutionary computation, what you get for your trouble is a single real number -- the loss: how bad it did. With neural networks, backpropagation gives you N real numbers: the gradient of loss with respect to each parameter.

Put another way: with evolution you have to stumble around blindly in parameter space and rely on selection to keep you moving in the right direction. With the gradient descent that neural networks use, you get, essentially for free, knowledge of the (locally) best direction to move in parameter space.

The bigger the models, the more this matters. Modern neural networks have millions or even billions of parameters, and that's been crucial to their expressive power. Good luck learning a program tree with a billion nodes using evolution. It might take 4.54 billion years.

daveguy · on April 13, 2016

> It might take 4.54 billion years.

And then only if you have a system powerful enough to accurately simulate a planet full of molecules.

Although I do think there is a balance between GA and structured NN which will lead to faster and better results than the deep NN alone. We already see some of the best deep NNs incorporating specific structures.

eggy · on April 14, 2016

I think neural networks and other forms of evolutionary computation will merge as I have been writing in my other replies in this thread. TWEANNs incorporate EC into evolving ANNs. The other article I cited above on soil mechanics, beat out expert systems, ANNs, statistics, and used GP. MEP, or Multi-Expression Programming for GP incorporates being able to put more than one solution into a gene without increasing the processing times thereby overcoming the inefficiencies of 1990s-era GP. Here is a recent article using it that is not behind a paywall or via sci-hub.io [1]. It needs better editing, but there are other references if you search for Multi-expression Genetic Programming.

  [1] http://benthamopen.com/ABSTRACT/TOPEJ-9-21

eggy · on April 14, 2016

First, the right tool for the job. ANNs are able to be a general function approximator with sufficient training to be a cost-effective choice to implement. Second, ANNs have been around about 35 years longer than GP. The TWEANNs I am studying, and that I already mentioned in a previous reply in this thread, hybridize ANNs and EC (GAs and GP), so if you include Neural Networks that utilize Evolutionary Computation techniques to modify weights or topology, then GP is being used to an extent. Replication as a variable in EC is the key force in biology, and I only see more use of EC techniques to enhance the general function approximators that are ANNs. Further, there are also hybridized computing machines that have been made, and are being made with FPGAs and GPUs. Finance and supercomputing are just two areas that are looking to utilize them. In some, the FPGAs are simply there for updating special computation programs that feed the GPUs. There is some research with a GP optimizer updating the FPGAs and then using the GPUs for the massive parallelization of the computations.

wjnc · on April 14, 2016

Thanks eggy, awesome replys. You should write some of your experiences down if you find the time.

nabla9 · on April 13, 2016

Evolutionary algorithms and genetic programming are global optimization technique, basically random search with some memory. It's not "out of fashion" any more than simulated annealing or Monte Carlo methods. They have limited usability, that's all.

AgentME · on April 13, 2016

>Topology and Weight Evolving Artificial Neural Networks

I brainstormed for a while about using genetic algorithms to decide the network topology. I'm glad someone else invented that already! Less work for me to do now.

matheweis · on April 13, 2016

Okay, that is straight up awesome. I've been toying with neural networks just enough to get a basic understanding of what they are and how they work, and it occurred to me that something like this might be possible.

Of course, I wasn't up-to-speed enough to know the right terms to look for, so thanks for sharing. :)

I am curious though... it seems like it would take orders of magnitude more computing power to not only train but evolve and re-train the networks. Is this practical with today's hardware?

nl · on April 13, 2016

This is great, but I think they should make it clear that this isn't using TensorFlow.

From the title and domain I though they either had ported TF to Javascript(!) or we connecting to a server.

sparky_ · on April 13, 2016

Wait - what it is using, then? I had assumed it was TF under Emscripten or similar.

nl · on April 13, 2016

It appears to be a custom NN implementation[1] in Javascript, somewhat similar to convnet.js[2]

As far as I can see the API[3] isn't much like TensorFlow.

[1] https://github.com/tensorflow/playground

[2] http://cs.stanford.edu/people/karpathy/convnetjs/

[3] https://github.com/tensorflow/playground/blob/master/nn.ts

minimaxir · on April 12, 2016

When it says "right here in your browser," it's not joking. On my desktop (Safari), the window becomes unresponsive after a few iterations. Does not happen in Chrome.

On my phone (Safari/iOS 9.3), the default neural nework doesn't converge at all even after 300 iterations while it does on the desktop, which is legit weird: https://i.imgur.com/KNaXeHH.png

shancarter · on April 13, 2016

I'm sorry you're having problems with Safari. I can't reproduce on my end, but if you're still having problems you can raise an issue on github with some information about your system.

davidgl · on April 13, 2016

Works perfectly for me on Safari 9.1 with no extensions

superobserver · on April 13, 2016

Working splendidly on ChromeOS, FWIW.

nefitty · on April 13, 2016

Yeah, it's working great in Chrome on my Galaxy Tab 3!

koder2016 · on April 13, 2016

To be honest, if it works in Chrome then it covers > 90% of people who would possibly be interested.

danielvf · on April 12, 2016

In case you are an idiot like me, you have to train your neural network by pressing "play".

andrewstuart2 · on April 13, 2016

"Okay, I don't understand. Why is my output so terrible?"

I saw the play button very clearly when the page loaded, then promptly got distracted by all the dials and knobs. :-P

shancarter · on April 13, 2016

We would've liked to have it constantly training, but didn't want to abuse your CPU :)

dingo_bat · on April 13, 2016

It pauses when I switch tabs :(

gojomo · on April 12, 2016

While it doesn't involve training, these 'confusion matrix' animations of NNs classifying images or digits are fun, too:

http://ml4a.github.io/dev/demos/cifar_confusion.html http://ml4a.github.io/dev/demos/mnist_confusion.html

Something about the high-speed updating makes me think of WOPR, in 'War Games', scoring nuclear-war scenarios.

timroy · on April 13, 2016

This demonstration goes really well with Michael Nielsen's http://neuralnetworksanddeeplearning.com/. At the bottom of the page the author gives a shout out to Nielsen, Bengio, and others.

For someone (like me) who's done a bit of reading but not much implementation, this playground is fantastic!

seansmccullough · on April 13, 2016

Really awesome article!

CGamesPlay · on April 12, 2016

Neat stuff, fun to play with. I wasn't able to get a net to classify the swiss roll. Last time I was playing around with this stuff I found the single biggest factor in the success was the optimizer used. Is this just using a simple gradient descent? I would like to see a drop down for different optimizers.

8note · on April 13, 2016

http://imgur.com/ypBQEWx

Add some noise, and use all the inputs, and one 8 wide hidden layer

edit: works better with a sigmoid activation curve, but it converges more slowly

andrewtbham · on April 13, 2016

Yeh you're on the right track. Nice pattern emerges on this after 160 iterations.

http://playground.tensorflow.org/#activation=tanh&batchSize=...

rmellow · on April 13, 2016

Using syn, cos, x1, x2 with 1 six-neuron hidden layer does the trick quickly: http://imgur.com/UMv5gsH

No need to mess with noise or regularization :)

makeset · on April 13, 2016

> Add some noise

This actually makes the dataset harder to fit to. It is not the same thing here as the "training with noise" method where random noise would be added to each batch, as an alternative means of Tikhonov regularization.

8note · on April 13, 2016

wih that particular data set, it looks like it really just adds more data, and more importantly, fills in the gaps along the spirals which is where my setup was having troubles.

The noise doesn't go far enough to start confusing points between different clusters, but it adds more points.

That said, my knowledge of neural nets is fairly limited.

cglace · on April 13, 2016

Using all inputs and 6 layers of varying sizes. After about 500 iterations. http://i.imgur.com/x1MOpvl.jpg

visarga · on April 13, 2016

Just 100 iterations, learning rate 0.03, activation tanh, regularization L2, rate 0.01. The network is 8,8,8 neurons per layer.

Obi_Juan_Kenobi · on April 13, 2016

Using the defaults, I had success at about 300 iterations with all the inputs and 5 hidden layers, each with a decreasing number of neurons (i.e. 6,5,4,3,2).

I don't know if that's a general feature to need fewer neurons with each layer, but that seems to work here.

chestervonwinch · on April 13, 2016

What were the optimization algorithms you had most success with? Were they more successful in the sense of better out-of-sample error rate or in the sense of quicker convergence (or something else)?

_AllisonMobley · on April 12, 2016

Can somebody explain what I'm watching when I press play?

terda12 · on April 13, 2016

Hopefully this helps (correct me if I'm wrong, I'm still learning about neural nets):

Think of the whole neural net as a function:

input * weight = output

At each iteration, we feed in the input to the neural net. Then the neural net compares what output it gets to the correct output.

For example, input1 is 5, and the correct output for input1 should have been 2. But the neural net got 3 as the output. So it then decreases the weights slightly so it would get 2.75 next time it has input of 5. Repeat thousands of times. That's the basic idea for machine learning and neural networks.

The algorithm it uses to figure out how much to decrease the weights is called "backpropagation" which uses gradient descent. To explain gradient descent, as as a roller coaster track. Imagine the roller coaster starts off on a random location on the track. Then gravity takes the roller coaster down the track until it ends up on a low point between two hills and stays there. This is the new location of the roller coaster. This new location is nice because it has the lowest energy the roller coaster could find, so it stays there. (We use derivatives to figure out the slope of a curve, which then gives us the direction where the curve goes downhill).

In neural networks, the roller coaster curve is the "cost function", which basically calculates the amount of difference between the neural net's output and the actual correct output it should have got. The initial weight is the roller coaster's initial position. The new weight is the roller coaster's final position, at the bottom of the cost function curve. This new position thus gives us the lowest cost.

Note that there may be even lower valleys, but when we roll the rollercoaster it stops at its nearest low valley. This is why we randomize the weights at the beginning - to put the roller coaster near possibly even lower valleys.

ryanmonroe · on April 13, 2016

Okay, so it works by minimizing (equiv. maximizing) some function. But that doesn't say much about how it "learns" the gradient. What function does it care about? Average squared error (predict_prob-Z_i)^2 ? Average absolute error? The likelihood function of some assumed distribution? Maximum distance between the classification border and closest observed points? If I saw someone carrying a bag full of blueberries and some bread home from the grocery store and asked to know how they chose to buy that, to which they replied "I had a list of characteristics which I thought where important for groceries to have in this trip to the store. For each grocery item, I recorded a vector of degrees to which the item possesses each of those characteristics. Finally, I chose the group of groceries that had the best combination of degree vectors", I still wouldn't really know anything about why they bought the blueberries and bread.

zodiac · on April 13, 2016

The function it minimizes is called the "loss function", and its value for the training and test sets are shown in the upper right area. AFAICT the site doesn't say how it's computed, but I think it's average squared error. The gradient is not learned; if you think of the loss function as a real-valued function of the weights, the gradient is just the partial derivatives with respect to the weights.

karterk · on April 13, 2016

It depends on what you are trying to achieve. There are many, for e.g. see: http://cs231n.github.io/neural-networks-2/#losses

e19293001 · on April 13, 2016

If I didn't get this course[1], I wouldn't understand what you are talking about.

[1] - https://www.coursera.org/learn/machine-learning

shoemai · on April 13, 2016

Super basic explanation:

It's training a neural network to classify a data set with two classes (orange or blue) and the data has two features (x1 or x2). All the orange and blue dots are the training data. So if you take a dot on the graph with coordinates (-2, 4) and it's blue, that would mean that a data point with x1 = -2 and x2 = 4 has the class blue.

You can think of a neural network as a function that can take in arbitrary features (in this case x1 and x2) and tries to output the correct class. That's what the orange and blue colors in the background are, the neural network's guess at the correct classification for any given point (x1, x2).

When you hit play, it iterates through the training data making adjustments to each neuron in the network so that it gets closer to predicting the right class.

If you want to see how well the neural network performs on data it wasn't trained on, you can click "show test data".

isuckatcoding · on April 13, 2016

Yeah I feel like we need some decent understanding of neural networks to have more context on this. Its kind of like being given a specialized shovel but not knowing why you need it or why you should dig holes.

Obi_Juan_Kenobi · on April 13, 2016

I think it's a playground in the best sense of the term. Take some time and actually play with it, and a lot of fun stuff happens, lightbulbs go off, etc.

If you're expecting a lesson, you'll likely be disappointed, but I think there's real value in a true playground.

I think the biggest improvement would be if, when hovering over a 'neuron', you get a visual representation of what feeds into it.

ryanmonroe · on April 13, 2016

For me in Chrome on OSX you do get a visual representation of the neuron's input when hovering. It shows up behind the data points in place of the neurons' output when hovering.

andrewtbham · on April 13, 2016

The dots are the training data. The orange and blue background are the estimation for how the neural net will classify a new orange or blue dot.

shmageggy · on April 13, 2016

It begins training the network using the backpropagation algorithm.

> Next, the network is asked to solve a problem, which it attempts to do over and over, each time strengthening the connections that lead to success and diminishing those that lead to failure.

On each iteration, it calculates how bad the predicted output is, then adjusts the weights between neurons to lessen that value. Google backpropagation for more info

Mahn · on April 13, 2016

Or perhaps explain how all the different inputs influence the result? I more or less get that it's just iterating over the data to approximate the given data set when you press play but I have no idea how giving it more or less neurons changes that, to name an example.

shoemai · on April 13, 2016

Basically each input gets multiplied by some weight that gets adjusted through each iteration. The product of the input and weight gets put through an activation function, and the outcome of that can be interpreted as the network's prediction of the class.

So you see the first neuron's input is just x1. You can see in the little graph at x1 that it's split down the middle with orange on one side and blue on the other. You can think of adjusting the weight on that neuron as adjusting where along the x axis the split occurs. All points on the orange side are classified orange and all on the blue side are classified blue. If you picked a data set like the spiral one or whatever, that neuron alone isn't going to make very many correct classifications. That's because it only gets the x1 value as input and can only affect the output by multiplying x1 by some weight, which would only have the affect of shifting the classification boundary left or right. You can see the same thing happening for the second neuron with input x2 except that now it splits along the y axis. Again that alone isn't going to match the data very well.

But then you get to the second layer, and the input of each neuron in the second layer is the output of each neuron in the first layer. So these neurons are able to take into consideration both x1 and x2 and are able to divide the data in more complex ways. So you can think of the neurons in each layer of the neural network as being able to consider more and more complex properties of the data in forming its output.

mholt · on April 13, 2016

The chart at the right is the output/result of the neural network's training. In the foreground you see actual data points that are used to train the neural net: to "teach" it how to classify orange or blue (unless you choose "regression" in which case it computes a numeric value). In the background you see the gradient that is formed by the network. The goal is to make the gradient form around the data points by color as closely as possible.

The neural network is essentially the nodes in the middle, linked together by various weights. During training, the test data points are fed forward into the network, creating an output. That output is then fed backward using something called "back propagation" which is used to adjust the weights.

Typically, the more hidden layers or nodes per layer, the more difficult gradients that can be learned. Zero hidden layers essentially forms a linear gradient that can only be used to split very basic, linearly-separable data (drawing a straight line to separate the different types)

Neural networks have lots of little knobs and levers you can adjust. That's what all these inputs are that you see.

karpathy · on April 13, 2016

this is very nice! I think that the reason swiss roll doesn't work as easily might be because of initialization. In 2 dimensions you have to be very careful with initializing the weights or biases because small networks get more easily stuck in bad local minima.

okigan · on April 13, 2016

In this case you see that it is the swiss roll so you could say pick "proper initialization".

But that technique would not work when you cannot see that it is a "swiss roll" or in multiple dimensions.

brianchu · on April 13, 2016

I'm pretty sure he wasn't talking about the swiss roll specifically. Big gains in neural net performance have been made through better initialization schemes (not dataset specific, just in general, e.g. an initialization scheme might adapt the initial weight distribution depending on the number of hidden units in the next layer), and smaller models are in general more sensitive to initialization.

danblick · on April 13, 2016

Has anyone been able to learn a function for the spiral (Swiss roll) data that's as good as a human-designed function would be?

asab · on April 13, 2016

http://playground.tensorflow.org/#activation=tanh&batchSize=...

asab · on April 13, 2016

Update: after playing with this for way too long, I've found that it can converge to a spiral with 3 or 2 or even just 1 node in the 2nd hidden layer.

The 1 node case is especially interesting, because when it converges the single node must learn the whole spiral pattern. Although with noise it can be less reliable with more jagged edges, as well as take longer to converge (also bumped the learning rate down), seeing the spiral encoded directly in the 2nd hidden layer is more interesting to me.

http://playground.tensorflow.org/#activation=tanh&batchSize=...

trampi · on April 13, 2016

0.007 http://playground.tensorflow.org/#activation=tanh&batchSize=...

moconnor · on April 13, 2016

For this simple example just choosing the largest possible fully-connected network with ReLU and L2 regularization to prevent overfit quickly converges to a nice spiral (test loss of 0.001 for me):

http://playground.tensorflow.org/#activation=relu&regulariza...

findthewords · on April 13, 2016

I wouldn't call it quick... spiral in 150 iterations, with sigmoid magic: http://playground.tensorflow.org/#activation=sigmoid&regular...

I find the pulsating unsightly.

amelius · on April 13, 2016

Neat. How would the number of neurons N scale with the size of the spiral? (Size=number of turns)

Will N level off, meaning that it will really understand the structure of the spiral?

scotty79 · on April 13, 2016

For me key was to use x,y and either sin x, sin y or x squared y squared as inputs and 5 or 6 neurons in hidden layer.

chronolitus · on April 13, 2016

Add both sin(X1) and sin(X2) as inputs.

hyh1048576 · on April 13, 2016

Do you consider test loss around 0.04 good?

halotrope · on April 13, 2016

You could totally optimise network architecture by crowdsourcing topology discovery for different problems into a multiplayer game with loss as a score.

Your_Creator · on April 13, 2016

So glad anns are becoming mainstream

Eventually it will have to be recognized as a new species of life, so I hope programmers, tinkerers and everyone else keeps that in mind because all life must be respected

And this particular form will be our responsibility, we can either embrace it as we continue to merge with our technology, or we can allow ourselves to go extinct like so many other species already have

For the naysayers - ever notice how attached we are to our phones? Many behave as if they are missing a limb without it - it's because they are, the brain adapts rapidly and for many, the brain has adapted to outsourcing our cognition. It used to be books, day runners, journals, diaries - now we have devices and soon they'll be implants or prosthetics

The writers at marvel who came up with the idea of calling iron man's suit a prosthetic were definately onto something and suits like that are probably our best chance of successful colonization of other planets. We'll need ai to be our friend out there, working with us

aab0 · on April 12, 2016

This is a lot of fun. The default dataset is too easy, though, try out the Swiss Roll one!

minimaxir · on April 12, 2016

There is a reason why sin(X) is an input property. :p

aab0 · on April 12, 2016

Using sin(x) or the other input features like x^2 goes back to making it too easy, though. So far the best I can do is 7 layers of 7 which gets a loss of 0.02. 3x7 is almost cracking the Swiss Roll but can't quite finish it off and gets stuck at 0.05: https://imgur.com/Z3f2ECc ... Surprisingly, 2x8 can do it, as long as I have noise or regularization on, but 8/7 then seriously struggles. Is 16 neurons a critical limit here?

teraflop · on April 12, 2016

I managed to get to 0.01 loss from only x1/x2, using 3 hidden layers, L1 regularization, a bit of added noise, and some patience: http://i.imgur.com/Y3zKpJF.png

aab0 · on April 12, 2016

Yes, noise & regularization seem to be key here. I've gotten a 2-layer with 7/8 neurons down to 0.06 and dropping but only with noise & l1: http://playground.tensorflow.org/#activation=relu&regulariza... Final loss of 0.051. Interestingly, increasing noise from 10 to 15 destroys performance, loss of 0.47.

vanboxel · on April 13, 2016

Is it really "making it too easy" if you're applying your knowledge of the structure of the problem space to make it easier for the computer to solve? Certainly this isn't easy to do with every problem, but it seems like a better idea in general to start with parameters you suspect to be correct.

In the "swiss cake roll" the circular nature of the classes suggests using a sin or cos function, and the fact that they spiral out suggests also inputting magnitude information. Sure, you can just add more neurons that will end up computing the same thing, but we might as well give the computer a head start when we can.

Lerc · on April 13, 2016

I look at things like this as not "making it too easy", but rather, "time for a more difficult problem".

I'd quite like if you could define your own input patterns and data sets.

nerfhammer · on April 12, 2016

All the patterns except the swiss roll work best without any hidden layers at all

sparky_ · on April 13, 2016

This is a very cool toy. As someone with no experience in ML, this is an interesting visual approach to the absolute basics.

And great for challenging your friends in an epic battle of convergence!

trgn · on April 13, 2016

If you like visual demonstrations of ML topics, you may be interested in http://ponder.hepburnave.com. It is an interactive demonstration of a self-organizing map, generating a 2D-map from a spreadsheet with multivariate data. It's an unsupervised learning approach, good for data exploration tasks, less so for classification tasks (/shamelessPlug).

visarga · on April 13, 2016

It's the classical exploration vs exploitation tradeoff. What do you do, try a radical new variation or fine tune this one?

pkaye · on April 13, 2016

I'm not well versed in neural networks but a lot of the new neural network software stacks coming out seem to be quite plug and plug. What kind of expertise would engineers need to have a few years from now when the technology is well developed and it doesn't need to be rewritten from scratch every time?

walrus · on April 13, 2016

I'm not qualified to answer this, but I will anyway.

To "operate" neural networks (as opposed to writing a framework for them), you need to know the building blocks. There are basic blocks like fully connected layers, convolutions, and nonlinear activations. Beyond those, there are higher level building blocks like LSTMs[1], gated recurrent units[2], highway layers[3], batch normalization[4], and residual blocks[5] that are made up of simpler blocks. Learning what these do and when it's appropriate to use them requires following current literature.

Operating neural networks requires some systems engineering skill. It takes a long time to train a single network and you'll find yourself trying many different architectures and hyperparameters along the way. Because of this, you'll want to distribute the training across many different systems and be able to easily monitor and deploy jobs on those systems.

A solid grasp of mathematics is useful to effectively debug your networks. You'll frequently find your network doesn't converge or gives totally garbage results, so you need to know how to dig into the network internals and understand how everything works. This is especially true if you're implementing a new building block from a paper.

Finally, know your machine learning and statistics fundamentals. Understand overfitting, model capacity, cross validation, probability, model ensembles, information theory, and so on. Know when a simpler model is more appropriate.

[1] ftp://ftp.idsia.ch/pub/juergen/fki-207-95.ps.gz

[2] http://arxiv.org/abs/1409.1259

[3] http://arxiv.org/abs/1505.00387

[4] http://arxiv.org/abs/1502.03167

[5] http://arxiv.org/abs/1512.03385

pkaye · on April 13, 2016

So you don't think some of these details will not be automated away in the near future so that it doesn't require a specialist to do operate a neural network?

vintermann · on April 13, 2016

Already, it's not nearly as hard as this demo makes it look. There's one recent advance in particular that isn't in this demo, and that is Batch Normalization.

If you've played around with it a bit, I'm sure you have seen that deeper layers are hard to train... You see the dashed lines representing signal in the network become weaker and weaker as the network gets deeper. BatchNorm works wonders with this. It takes statistics from the minibatch of training examples, and tries to normalize it so that the next layer gets input more similar to what it expects, even if the previous layer has changed. In practice you get a much better signal, so the network can learn a lot more efficiently.

Without BatchNorm, more than two hidden layers is tedious and error-prone to train. With it, you can train 10-12 layers easily. (With another recent advance, residual nets, you can train hundreds!)

Such advances pushes the limit for what you can train easily, and what still requires GSD ("graduate student descent", figuring out just the right parameters to get something to work through intuition, trial and error). You still have to watch out for overfitting, but the nice thing about that is that more training data helps.

walrus · on April 13, 2016

I think most of those things will remain important:

+ Designing the network architecture is a means to instill your knowledge of the problem into the network. For example, using convolutions over images encodes some translational invariance into the network. It makes up for lack of data. I don't think data augmentation alone is enough, either: if you use a "stupid" architecture with heaps of data, the computation will become too expensive or slow.

- The systems engineering part will probably get automated. I bet there are Amazon engineers crying at their desks while working on AWS Elastic Tensorshift right now. So unless you're specifically interested in that side of things, maybe this isn't the best area to focus on.

+ There are always going to be problems, so knowing how to debug is a useful skill.

+ ML/stats fundamentals aren't going away. You need to know what you're trying to do before you can do it.

plafl · on April 13, 2016

Beautiful. The next time someone asks what is machine learning about I'm going to send a link to this page.

nxzero · on April 13, 2016

"Don’t Worry, You Can’t Break It. We Promise."

(Nice, but it's completely unclear what's going on.)

visarga · on April 13, 2016

My MacBook Pro running El Capitain froze my mouse and keyboard. I had to do a hard reset.

nkozyra · on April 13, 2016

Is a 50/50 training:test a normal default ratio for an ANN? I expected to see a higher amount of training data represented as the initial setting.

hyh1048576 · on April 13, 2016

One of the finest data visualization I've seen.

icelancer · on April 13, 2016

This is so great. An easy way to show my friends WTF I do sometimes for math/CS work. Thank you so much.

imaginenore · on April 13, 2016

I wish they had more interesting data sets.

HappyTypist · on April 13, 2016

Submit a pull request.