Feature Learning. Deep Neural Networks can learn features from essentially raw d...

Kip9000 · on June 5, 2016

No. In all supervised learning, the algorithm learns a model from the data and generalise it for unseen data. Better generalisation means better model. In unsupervised learning, (mostly clustering, density estimation etc) the same thing happens but we don't tell the algorithm what to learn.

Deep learning is not machine learning plus something else. It is a collection of techniques that overcomes the scalability problem of feed forward neural networks. NNs are very difficult to scale over number of layers. Standard training method of back propagation can't handle many layers because of vanishing gradient and the computational infeasibility brought on by the explosive growth of connections.

NNs are very difficult to scale with additional classification targets you may require (for example, you have a classifier for categorising 10 classes, but to scale it up to 20, requires a lot of topological changes and qualitative analysis.)

Deep learning addresses the scaling over layers with various techniques coupled with hardware acceleration (GPUs). Currently this stand at about 150 layers.

Houshalter · on June 5, 2016

Even experts often use the "feature learning" analogy. I don't think it's wrong, or at least a bad way of explaining it.

The difference between (deep) neural networks and shallow machine learning, is that NNs can learn arbitrary features. Yes clustering doesn't require feature learning. But it is also super limited in the kinds of features it can learn. Neural nets can learn arbitrary circuits, and other types of functions.

halflings · on June 5, 2016

Gaussian Processes can learn any arbitrary function. Is it "shallow" machine learning?

I think the point that the parent makes is valid: Most of the advantages of deep learning when using a "simple" feed-forward topology is advances related to scaling learning and solving problems encountered at with difficult tasks like image recognition, etc.

I do not know enough about neural nets to say if that is all there is to it, but one thing is sure: it's not just about "learning features", although it was shown that the output at every layer abstracts some sort of higher-level features (in the case of image recognition)

YeGoblynQueenne · on June 6, 2016

>> it's not just about "learning features"

So, I'm in no position to prove this, but my intuition is that any machine learning algorithm can be configured in a semi-supervised learning set-up, like deep nets have. You could train a decision forest classifier for instance to learn in an unsupervised manner. An algorithm I'm developing for my MSc dissertation is essentially unsupervised recursive partitioning, a.k.a. decision trees (only, first-order rather than propositional).

Well, possibly not _any_ algorithm. But I get the feeling that many classifiers in particular could be adapted to unsupervised learning with a bit of elbow grease, at which point you could connect them to their own input and, voila, semi-supervised learning.

But like I say, I don't reckon I'll be in a position to prove this any time soon.

Houshalter · on June 6, 2016

GPs require exponentially many parameters though. They can't learn arbitrary functions, they just stupidly memorize a lookup table.

YeGoblynQueenne · on June 6, 2016

The way I know it goes along the lines of: "multilayer perceptrons with no more than three layers can learn any function to arbitrary precision given a large enough number of inputs".

But like halflings say, neural nets are not alone in this. Decision Trees can learn any binary decision diagram I guess (they can encode arbitrary disjunctions of conjunctions). I'm pretty sure there are similar results for other algorithms also.

In any case, you can represent a function as a set-theoretical relation and enumerate its parameters- and there you go, learning done with arbitrary precision. That's not what makes neural nets impressive. So what is it?

"Shallow machine learning" is a worrying neologism. "Shallow" and "deep" only apply to neural networks, really. You couldn't very well distinguish between shallow and deep K-NN classifiers, say. Or shallow and deep k-means clustering. I mean, what the hell?

nightski · on June 5, 2016

Clustering also isn't a supervised learning technique. Even though you might say DNNs can be unsupervised (autoencoders), it generally is not the case in practical systems. So it's not a good comparison at all.

Houshalter · on June 6, 2016

I don't care about the supervised/unsupervised distinction. I'm saying they can learn features automatically (with or without supervision.)

YeGoblynQueenne · on June 6, 2016

Well, feature learning is unsupervised by necessity, otherwise you're not learning features, you're learning a mapping between features and labels.

Deep nets used in the way you say are first trained unsupervised to extract features, then the features are used in supervised learning, to learn a mapping from those new features to labels.

You can also do this "by hand" using unsupervised learning techniques like clustering, Principal Component Analysis etc: you make your own features then, and train a classifier afterwards, on the features you extracted in that way.

Deep nets just sort of automate the process.

etatoby · on June 5, 2016

Holy cr*p on a cracker!

150 layers? It boggles the mind.

How do you even start propagating over 150 layers? Do you assign specific functions / targets to some of the inner layers?

Kip9000 · on June 5, 2016

Deep Residual Learning for Image Recognition https://arxiv.org/abs/1512.03385

Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

And some good answers here: https://www.quora.com/How-does-deep-residual-learning-work

pedrosorio · on June 5, 2016

Also, very deep NN without residuals: https://arxiv.org/abs/1605.07648

PeterisP · on June 6, 2016

Highway layers http://arxiv.org/abs/1505.00387 help with this propagation.

phinance99 · on June 5, 2016

I agree (except for the first word), however I read the question with emphasis on "usual", as in, "What makes DNNs special?"

There's pure performance (ex., in a Kaggle competition [http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-it...] or on a standard data set [http://yann.lecun.com/exdb/mnist/], [http://blogs.microsoft.com/next/2015/12/10/microsoft-researc...] ), but that's what makes any ML method better than another.

I think the deeper awesomeness is that DNNs so good at Feature Learning from raw data. On vision, NLP, and speech problems [nice overview by Andrew Ng: https://m.youtube.com/watch?v=W15K9PegQt0] DNNs have achieved superior performance to the combination of expertly-engineered features + some usual ML algorithm.

Where a "usual ML" pipeline might look like (1) engineer features through manual effort by studying raw data and the problem domain, (2) apply ML to those features, a new DNN pipeline might look like (1) Apply DNN to raw data.

First off, removing the feature engineering step could be a huge savings in human time spent. Second, there's the potential to get a better answer (!) when you're done.

But more than that, the DNN pipeline holds the promise of more regular, systematic improvement. We (as engineers) don't have to wait for a bright idea about how to construct a feature from the data. Instead, we can focus on (1) collecting more and better data, (2) improving the optimization algorithms, and (3 acquiring more computing resources.

These latter tasks, I suspect, are easier to define and evaluate than the task "discover a new feature".

nightski · on June 5, 2016

You might not call it feature engineering, but let's face it - most DNN models vary dramatically in structure based on the problem at hand.

YeGoblynQueenne · on June 6, 2016

Yep. Have a look at DNNs for image recognition, or LSTM RNN. They're the results of some furious architectural work by researchers and not at all simple to come up with (though they may be simple enough to understand now someone's created them).

eshvk · on June 5, 2016

> Deep Neural Networks can learn features from essentially raw data. Usual machine learning starts with features engineered manually.

What does this mean? What is "raw data" and what is a "feature engineered manually"?

hacker42 · on June 5, 2016

Tens of thousands of engineers (audio, vision, linguists etc.) spent millions of hours for billions of dollars in the past 30 years to invent algorithms that reliably tell us something about a bunch of data. For example, an corner feature algorithm (such as SIFT) can extract the locations of corners in an image and characterize them. This is essential to many kinds of information processing tasks because we want to apply the same algorithm to different data (generalization), so we kind of need an interface to the data. This interface is called a feature (or feature algorithm, feature extractor or feature-descriptor).

All of this work (some of these papers have on the order of ten thousands of citations) is now obsolete because you can start with a random initialization of the weights of a neural network and iteratively improve the weights using backprop for any kind of task. All you need a measure of improvement that is relatively smooth and differentiable with respect to the network weights. What is surprising is that the circuits and programs within reach of backprop training of fully connected neural networks are actually astonishingly good at what they do. But ultimately, this is maybe not so surprising given that our brains do something similar all the time.

Kip9000 · on June 5, 2016

>All of this work (some of these papers have on the order of ten thousands of citations) is now obsolete because you can start with a random initialization of the weights of a neural network and iteratively improve the weights using backprop for any kind of task

Hardly correct. You can't magically learn any kind of task. You can't add arbitrary number of layers and hope for the back prop to do its magic. It is difficult. Deep learning techniques are what makes it somewhat feasible.

SIFT is not obsolete because of NNs. They all have their pros and cons. You have to select the right tool for the job. BTW SIFT is not an edge detector (That's the Canny Transform). It describes images using salient features in scale invariant manner.

hacker42 · on June 5, 2016

Typos fixed. "All" was hyperbole of course, but I think it definitely does not look good for the majority of the work done on features. SIFT was recently outperformed PN-Net for example.

nightski · on June 6, 2016

SIFT is also quite old. It's amazing a single technique has retained so much value. Isn't it curious that modern convnets use convolution. On top that, they do convolutions at multiple scales (pooling). Starting to sound very familiar...

hacker42 · on June 6, 2016

Actually the neural net approaches are older than SIFT.

Neural nets learn the distribution and even causal factors in the data. To me it seems that this distribution is often just too complex for it to be robustly captured by something that doesn't learn. Learning causal factors critically depends on learning along the depth of the network of latent variables which is a particularly opaque process, but this is what MLPs seem to do quite canonically (convnet being just a restricted special case of MLPs). I mean discerning causal factors is pretty much canonically the act of accumulating evidence with priors (weighted summation), deciding whether it is sufficient evidence and signaling how much it is (non-linearity).

nightski · on June 6, 2016

Some of the approaches are, some aren't. SIFT itself builds upon knowledge that is much older than it. Either way it doesn't matter. The OP was arguing that the many years of man effort put into SIFT was a complete waste. I am saying that this is very shortsighted, as non-machine learning vision techniques have heavily influenced how we approach and think about vision problems even when using ML.

YeGoblynQueenne · on June 6, 2016

>> SIFT is not obsolete because of NNs. They all have their pros and cons.

Case in point- DNNs for image recogn. use Sobel edge detectors and other "obsolete" filters to do their magickal magic.

YeGoblynQueenne · on June 6, 2016

>> we kind of need an interface to the data. This interface is called a feature (or feature algorithm, feature extractor or feature-descriptor).

Excellent in-a-nutshell explanation of features and thank you for a definition I really hadn't thought of.

This though:

>> this is maybe not so surprising given that our brains do something similar all the time.

Is just so much fantasies, sorry to say. Neural nets (and machine learning in general) learn in ways that are completely unlike the human. They need huge, dense datasets, we can make do with scraps of sparse data. They need huge amounts of computational power, and time, we learn in the blink of an eye. They learn one thing at a time and can't generalise knowledge to even neighbouring domains, we can, oh yes indeed. An infant that can recognise images at the level of AlexNet, can at the same time tie its own shoelaces, speak rudimentary language and protect itself from danger etc. AlexNet can only map images to labels. It does that very well, but it's a one trick pony and so are all machine learning algorithms, fearsomely effective but heart-breakingly limited. Human minds are generalisation machines of the higest order and we are nowhere near figuring out how they (we) do it.

Think of it this way: it took a few dozen researchers a few decades to come up with backprop. It took evolution billions of years to come up with a human mind. Which one do you think is the more optimised, and how much hubris does it take to convince oneself that they are pretty much the same in capabilities?

eegilbert · on June 5, 2016

One example is this recent paper on learning (somewhat) high-level attributes of text from character streams alone (i.e., without telling the convolutional networks that things like words and punctuation exist).

https://arxiv.org/abs/1502.01710

YeGoblynQueenne · on June 6, 2016

>> We show that temporal ConvNets can achieve astonishing performance

Yay, astonishing performance! I'm totally gonna waste half an hour of my life to read about what awesome badassery convnets are! Because that sounds so objective!

/snark

Homunculiheaded · on June 5, 2016

The Neural Network Playground is great for understanding this[0]!

The default example is classification of a circle of one class surrounded by a donut of another. There are two features x_1 and x_2 (this is the "raw data").

One solution to this problem is to use a single layer and a single neuron but engineer features manually. These manually engineered features are x_1*x_2, x_1^2,x_2^2, sin(x_1) and sin(x_2). Here's a link to this model (long url)[1].

This model performs very well at learning to classify the data just by combining these manual features with a single neuron. The problem is a human needs to figure out these features. Try removing some and observe the different performance given different manual features. You'll see how important it is to engineer the correct ones.

Alternatively you can have 2 layers of 4 neurons [2]. In nearly the exact number of iterations this network also learns to classify the data correctly. This is because the non-linear interactions between neurons are actually transforming the inputs the appropriate ways. That is to say the networks is learning to engineer the features itself. Try removing layers/nodes and you'll find that a simpler network will have a harder and harder time at this.

I recommend playing around with the various tradeoff between manually engineered features and network complexity. The interesting thing you will observe is that in some cases the manual features are much faster to learn a simplier model than the network. The big issues comes up when we can't simply "see" the problem in 2d so we have no idea what features may and may not be useful.

[0] http://playground.tensorflow.org/

[1] http://playground.tensorflow.org/#activation=tanh&batchSize=...

[2]. http://playground.tensorflow.org/#activation=tanh&batchSize=...

foobarqux · on June 5, 2016

Some of the prominent achievements in deep learning, like AlphaGo, used manually specified features.

Kip9000 · on June 5, 2016

Have you got a reference?

daveguy · on June 5, 2016

How about the AlphaGo paper itself:

https://www.google.com/url?sa=t&source=web&rct=j&url=https:/...

Bottom of page 23 and appendix tables around page 31 and 32.

Edit: The stated big "next attempts" for them will be to learn these of features via an algorithm rather than handcoded. And to learn based on self play rather than a database of master games.

facepalm · on June 5, 2016

There are lots of other unsupervised learning methods in machine learning.

vonnik · on June 5, 2016

[flagged]

Kip9000 · on June 5, 2016

I think the article does a good job at explaining the major points. But what you have described in here can also be said about MLPs, nothing deep about them on their own. For example XOR function with MLP combines features to come up with more complex features.

lunula · on June 5, 2016

A ray of light. MLPs can approximate any nonlinear function in the domain they have been trained on. What is is about the depth that makes DNNs more tractible to train than shallow networks? Is it that the particular tricks that have been developed for DNNs haven't been generalized to work at arbitrary depths? Is it that it is easier for humans to design the abstractions that are used when they are layered? Are you aware of any theoretical work in this direction?

Kip9000 · on June 5, 2016

>MLPs can approximate any nonlinear function..

Theoritically yes. But the drama is when you have to actually do it. DNNs are not more tractable on their own, they are made feasible by current set of techniques.

>Is it that it is easier for humans to design the abstractions..

You could argue that activation maps generated in convolutional layers by the filters are feature engineering, as those filters are manually created. These are problem dependent, and we know more about the problem than the algos. That's why feature engineering hasn't gone away completely.

marcolinux · on June 10, 2016

Oops, accidentally flagged GP. Sorry for that, hope the unflag works :/.