This is a very interesting perspective on neural networks, is it novel? I've never seen a geometric interpretation of NNs.
In particular this:
> (Apparently determining if knots are trivial is NP. This doesn’t bode well for neural networks.)
Is there any theoretical research on obstructions to NN learning? Not that it would change much the practice (for instance, MLE learning of gaussian mixtures is NP-hard, but everybody does it anyway), but it could shed some light on the geometry of the "hard" instances.
EDIT: for example, to get a sense of how ill-conditioned is deep NN learning, a recent paper [1] shows that if we feed to an object classifying NN an image with crafted but unnoticeable perturbations the predictions change completely.
There is large body of work from the mid 1980s and early 1990s that addresses the question of hardness, sensitivity and robustness from various statistical physics/computer science collaborations. Of course, it is still ongoing, but that was a big period. The basic thrust of this program is to take methods from statistical mechanics and use them calculate not just worst case complexity, but also average case, best case, and everything in between. It is hard to think of a reference to end all references, but as far as books go you might want to check out: http://www.amazon.com/The-Nature-Computation-Cristopher-Moor.... For an article that briefly addresses some results related to kSAT there is: http://www.cs.cornell.edu/selman/papers/pdf/99.nature.phase.....
I have seen Riemann geometry applied to neural networks. Basically the state space is curved, so you can speed up learning if take the curvature into account when sampling in the space of possible neural nets. (you want to explore the hypothesis space evenly)
Information geometry for Neural networks, 1998, Daniel Wagenaar:-
I think the author has confused NP with NP-hard (or NP-complete). Indeed, it is extremely unlikely that the unknottedness problem is NP-complete, as it is already in NP ∩ co-NP (assuming the generalized Riemann hypothesis). Furthermore, we know that the problem of computing many knot invariants is "only" BQP-complete, so it's possible that unknottedness is in BQP (i.e., efficiently solvable by quantum computers), though I don't know the expert opinion on this.
Indeed I read that as NP-complete. That's very interesting, do you have any references on it?
Even if knots are ruled out, there might be hard topological decision problems that can be reduced to linear separability under homeomorphism, maybe in higher dimension.
I think machine learning boils down to 2 fundamental issues:
1) Having a model class that is powerful enough to represent the data.
and
2) Being able to find a "good enough" optimal solution over that model class.
In principle over-fitting can be addressed through the choice of the particular optimization criteria in (2).
Neural networks are used because they are powerful enough to represent many data sets of interest and because there exist good algorithms for finding LOCAL optima.
Unfortunately they are complex non-linear, non-convex functions and hence finding a global optimum is most likely NP-hard.
We are left then with the heuristic of hoping that the local optima we are able to find are "close enough" to the global optimum to meet our needs.
This seems to work reasonably well for certain types of problems, like image classification, where the underlying data possess a certain simplicity or "smoothness", but less well for other problems, like logic problems where good and bad solutions may not be "near by" in any sense that the local optimization algorithms are able to discover.
Bloody brilliant. I have longed for a ML methodology that breaks out of the float vector representation because I don't think it describes naturally occurring problems that well, most ontologies are semi-structured. However, this really gets to the nub of the problem, real, but solvable, problems are manifold complexes. I don't think anyone has really built a learner for complexes though, persistent homology is a stab in the right direction but its not there yet IMHO.
I don't like his proof that a 1-2 hidden unit system can't handle the red circle in the blue torus. Feels too handwavy.
My proof: Suppose the matrix W is singular. Then there exists a nonzero vector v for which W v = 0. Find a point x in the red circle, then find a different point y=x+av (for some scalar a) in the blue torus. Wy+b=W(x+av)+b=Wx+aWv+b=Wx+a0+b=wX+b. Fin.
Wouldn't this only apply to neural networks as used for classification? I mean the general paradigm of deforming curves until they're separated by a hyperplane seems pretty obvious now that I see it in front of me, but what about neural networks used to approximate continuous functions?
I'll take a stab at this (I'm a decade out from my last machine learning class, so no guarantees on correctness): The only reason it's fitting a hyperplane is because one class is being mapped to the continuous value -1.0 and the other class is being mapped to the continuous value 1.0 and there's a thresholding step (the hyperplane perpendicular to the line onto which the continuous values are being projected) at the end to determine the class. If you're doing regression instead of classification, your training data will be fed in with more output values than just 1.0 and -1.0 and you'll omit the thresholding at the end, but otherwise the behavior and intuition should be the same as in the article.
The article is exceptionally clear and well presented. We need more of this level of writing about machine learning.
For me, it is most important to note what the article touches only at the end. If you have good features and representation, a linear model may be enough. Real world data and tasks usually don't require knot untying. In my opinion, the model (e.g., neural net) should be kept simple, and the challenge is in finding the right features and representation.
Coates & Ng (2012) wrote:
Many algorithms are available to learn deep hierarchies of
features from unlabeled data, especially images. In many cases, these
algorithms involve multi-layered networks of features (e.g., neural net-
works) that are sometimes tricky to train and tune and are difficult to
scale up to many machines effectively. Recently, it has been found that
K-means clustering can be used as a fast alternative training method.
The main advantage of this approach is that it is very fast and easily
implemented at large scale. On the other hand, employing this method
in practice is not completely trivial: K-means has several limitations, and
care must be taken to combine the right ingredients to get the system
to work well.
This is very interesting. I wonder if anyone knows what can be gained in terms of separating out different topologies by allowing for a restricted family of conformal transformation on top of the tanh functions, of the sort that allows taking the insides out and so forth. Comments?
To get the effect I was referring to, yes it is too strict a class. I wasn't thinking very hard about it, but my idea was that conformal transformation have nice compositional properties, which could be exploited in developing the method and its analysis.
A bit of an aside: Are there any methods for compressing/collapsing/simplifying a neural network.
What I mean is imagine you've built and trained a neural network, as per the article, it is hard to ascertain exactly what it is doing. I was wondering whether there is work in this area, and it occurred to me a possible first step would be to collapse the neural network to a simpler but functionally equivalent structure.
I imagine this is far more difficult than it sounds, but I just wondered.
The Universal Approximation Theorem[1] asserts that you only ever need one hidden layer, which at least asserts that "an (approximate) simplification exists".
But I can't say off the top of my head how you'd collapse an ANN just two hidden layers into 1. It's not obvious how sigmoid functions compose, but I suppose I should give it more thought...
Although one hidden layer suffices theoretically, most of the cutting-edge work involves multilayer networks. Perhaps paradoxically, those become more "auditable" from what I've heard. (I've spoken to people with real-world experience of applying ANNs to unsolved problems, but I am not one.) What I've heard is that, with one layer, you get too much overloading (i.e. different stuff ending up at the same hidden node) to understand what it is doing, but that deeper networks are more legible (with appropriate visualization tools).
UAT says that a solution exists, but it doesn't put a limit on the number of nodes required, so it would have you doing an optimization in a space that is not just large, but of arbitrary finite dimension. It can be pretty nonconstructive (in the sense of proving "there exists" without showing how to find something) insofar as it's often non-trivial to get convergence to a working solution in reasonable time.
As for how sigmoids compose, imagine how bell-shaped curves would compose, just as you can make a painting out of bell-shaped "points" if allowed arbitrary precision/steepness. Now, the difference of two sigmoids can be bell-shaped, e.g.
http://www.wolframalpha.com/input/?i=plot+y+%3D+1%2F%281%2Be... . I don't know how much this means in practice, but it establishes the possibility.
With respect to the deep learning networks ( as well as more traditional with just a weight matrix and bias), we can look to matrices for this. I will offer a compressed representation, not so much a way of pruning, but I explain why below.
Each single layer neural network is made up of 3 matrices, a weight matrix (connections), visible bias, and a hidden bias.
In theory, this can be represented as a flattened array.
This is what I do in deeplearning4j[1] for optimization (note: I'm the author)
The problem with pruning neural nets so to speak is this isn't really a search problem we're solving like A*. Both are graphs in a sense, but each neuron in a neural net when we have backpropagation gets updated to correct for error that it caused, rather than pruning like in Alpha Beta Pruning for Game playing AI.
I will offer one last thing and say that the way neural nets learn (especially if you stream data in to it for training rather than training all at once via online/mini batch learning) each neuron also tends to learn different components of an overall solution and some will activate more on certain feature vectors when you train them on an overall data set.
A solution to this is to set the neurons relative to the input size[2].
There are pruning methods which try to find and remove connections and neurons that don't have much effect on the network, essentially simplifying it.
There are also some methods such as HyperNeat which try to find simple structures that can be expanded into a much larger neural network. By sharing weights and having sections of the network that repeated many times.
SVM is great, it's often stacked on top of neural networks to do classification.
But it can't really deal with extracting features, especially from things like images and sounds. If you fed images ( pixel by pixel ) straight to the SVM and tell it to classify them, it would do extremely poorly - especially with big images.
A better thing is to have a deep ( layered ) network that will get pixel data from the input and pass it through layers that will then extract some useful features from it - for example, learn to recognize lines.
Then, after this information passes through the whole network, you are finished with smaller amount of features that are more useful for classification. Then you an send this information to SVM and it will deal with it a lot easier than if you tried to feed it straight pixels.
The thing is, we don't really have problems with classification algorithms - SVM does extremely well. What is the biggest hurdle is simply extracting features that then can be fed to well-working classification algorithms.
> One could learn the vector field at fixed points (just take some fixed points from the training set to use as anchors) and interpolate in some manner.
One such interpolation that has nice properties (smooth everywhere, invertible, can be constructed for any set of pointwise correspondences) are diffeomorphisms.
Unfortunately, the mathematical objects are infinite-dimensional, which makes computation with them somewhat involved, even if they are derived from a small-dimensional set of point correspondences. See, e.g., <http://www.cs.unc.edu/Research/MIDAG/pubs/papers/Joshi_TIP_2... for one method of constructing them.
Off topic: I wish, there would be a practical way to program for neuromorphic chips, which are inherently analog. Anybody have some idea or links regarding this field?
In particular this:
> (Apparently determining if knots are trivial is NP. This doesn’t bode well for neural networks.)
Is there any theoretical research on obstructions to NN learning? Not that it would change much the practice (for instance, MLE learning of gaussian mixtures is NP-hard, but everybody does it anyway), but it could shed some light on the geometry of the "hard" instances.
EDIT: for example, to get a sense of how ill-conditioned is deep NN learning, a recent paper [1] shows that if we feed to an object classifying NN an image with crafted but unnoticeable perturbations the predictions change completely.
[1] http://arxiv.org/abs/1312.6199