Why is "deep learning" so hype nowadays? They seem like just another AI tool that will be prone to over-fitting datasets and provide an analysis that is difficult to mathematically characterize and understand in a reliable way.
Not here to be cynical/negative -- they might be of great value, this is not my expertise. Can someone explain why deep learning articles are receiving attention rather than, say, Support Vector Machines / kernel-based methods of pattern analysis? Or other nonlinear analysis? Are they related?
I think you are right in your sentiment that deep learning is becoming over-hyped. I've written before that it's mostly a method of brute forcing the problem of AI that happens to work very efficiently on modern GPUs and other parallel processing units. That being said, for the tasks it does well on (mostly facial and other types of simple image recognition), it does extremely well compared to everything else out there. (And it may very well end up as a part of a future architecture that better simulates general intelligence.)
However, I don't think it is reasonable to assume that AI tools can only be valuable if they can be rigorously mathematically characterized. The "holy grail" of intelligence -- that is, actual human intelligence -- certainly can't or at least hasn't been mathematically characterized, and I don't think anyone will ever offer any "proof" (mathematical or otherwise) that the biological brain is inherently wired to arrive at or at least tend to converge to correct solutions to intelligence problems. And of course no one will argue that human intelligence isn't valuable!
>The "holy grail" of intelligence -- that is, actual human intelligence...
When people talk about human-level AI, are they referring to specific beneficial subsets of human intelligence, like holding a conversation or interacting with the world around them, or does it literally include all of human cognition? In other words, would the "holy grail" include AI that acts extremely irrational and makes poor decisions based on temporary chemical imbalances in the body? Does the perfect human-level AI get depressed and commit suicide some of the time? Are we trying to replicate all of the parts, or only the good parts?
Good point. I would compare it with human speech. Human speech is a powerful mean to communicate and for this reason it is thus very tempting to develop artificial systems that can communicate in the same way. But human language is far from optimal. In the same way, I believe that human brain is powerful but far from optimal.
The strength of human brain is its ability to adapt to fast changing context. The solution (function) it finds is however usually far from optimal.
I also think that the future of computing is the development of systems that can efficiently adapt their rules of actions according to execution context evolution. The programmers of today will then become trainers.
If my machine learning algorithms went wrong as regularly as the term "human error" denotes, I wouldn't be able to use them to run much of anything. Identifying faces, for instance, is just a matter of statistical accuracy, of pattern-matching, but a more complex task like automated asset-trading or world domination requires a program that will make fewer mistakes than even many trained humans.
Why? Even in your hypothetical world domination scenario a program can make many mistakes but as long as it wins more territory than it loses it is still successful.
One thing is making a certain mistake because a choice based on erroneous or insufficient data has gone wrong (No matter how great the AI, this will always be an issue), and a very different one is making a mistake even with complete and correct data, something that humans are prone to.
Deep learning has pushed up the state of the art exponentially in the last few years. On some benchmarks like speech recognition, deep learning methods increased the state of the art by what would have taken decades at the previous rate of progress. This years Imagenet winner achieved 5% error. Last year was 11%, the year before 15%. (Note percentage error isn't linear, each percentage point is harder than the last.)
Everything from Machine vision to natural language processing to speech recognition, is benefiting from this. We live in exciting times for AI, and everyone wants to get in on it.
Simple machine learning algorithms are interesting and even superior on simpler problems. But they don't scale up to complicated AI tasks like this. They are just doing simple correlations between variables, whereas NN's can theoretically learn any arbitrary function (theoretically Turing complete.) Usually this does make them overfit, but on more complicated problems underfitting is the bigger problem.
> Everything from Machine vision to natural language processing to speech recognition, is benefiting from this. We live in exciting times for AI, and everyone wants to get in on it.
Not "everything" is benefiting from multilayer CNNs; in fact their use is limited to a tiny fraction of machine learning problems. The majority of data problems in the industry involve small datasets with a limited number of dimensions; picture classification and speech recognition are outliers in their scale, although they are extremely important problems.
Also, the fact is that multilayer CNNs are not a recent development, even though the catchy term "deep learning" is. The recent development that has brought us significant performance improvements in image classification and speech recognition is mainly to put CNNs on GPUs, allowing us to scale the networks considerably. So you could say that what is going on here is really hardware progress, not any kind of theoretical breakthrough (there is surprisingly little theory behind multilayer CNNs).
At last, improvements over the state of the art on a number of large-scale ML problems is not the reason there is so much hype about "deep learning" (which is really the same kind of shallow learning that we've been doing for a while), both in the mainstream media and in the tech community. We have entered a new AI summer, where algorithms that have been around for decades are being hailed as being the "real deal" that works just like the brain (nope). A company like Vicarious has raised 50M not long ago on a promise to create "human-level vision" by 2015 and "strong AI" by 2018. Somehow I don't think investors will see much of a return on that money.
This is all nice and good, but after summer comes winter. The hotter the summer the darker the winter. It has happened before, and it has hurt AI progress very much.
>The majority of data problems in the industry involve small datasets with a limited number of dimensions
This really depends on your industry. And you're forgetting much of the deep learning improvements when applied to natural language processing.
>A company like Vicarious has raised 50M not long ago on a promise to create "human-level vision" by 2015 and "strong AI" by 2018. Somehow I don't think investors will see much of a return on that money.
What evidence do you have that Vicarious is even using deep learning?
To be honest, I have never understood all the hoopla concerning the universal approximation theorem. Least squares in your favorite orthormal basis for L2[0,1] will do the same thing and it can be easier to match the correct basis with your problem domain. If there was some result showing that NNs converge at a faster rate (as a function of the number of parameters being fitted) that would be nice, but I am not aware of any. Also, universal approximation has nothing to do with Turing completeness.
With the right basis functions, yes, you would end up approximating the target function well. The problem is how do you determine the right basis functions? In SVMs, for example, there is no one good way to pick the right kernel today. This is a benefit with ANNs - the universal approximation theorem tells you that there is nothing else you need to decide on once you're using ANNs. Sure there are parameters like the no of layers etc, but those parameters are something you need account for in other techniques too.
Not saying ANNs give the best solution always. Or even the best first model to try out on all problems; only pointing out that there is a convincing case to be made for them, and the universal approximation theorem has a lot to do with it.
Wrt your other point, Turing completeness was proved separately - look for the paper "Turing computability with neural nets" by Siegelmann and Sontag. Not sure your parent comment intended the association.
A link to a textbook draft from prominent ML researchers specializing in deep learning that is shockingly also about deep learning is not an example of hype. Yoshua Bengio has been working on neural nets for years and years and he leads the only one of the big three deep learning labs to not have a lot of commercial entanglements. I agree there seems to be some extra attention for deep learning recently. So who is hyping this stuff? Not the people in the big three (Toronto, NYU, Montreal) deep learning labs. They are just doing what they have always done, work on neural nets and solve problems and write papers (and in this case a textbook).
Google, Microsoft, Facebook, Baidu, and to a lesser extent IBM, have all made large commitments to using deep learning methods. Why? Because they solve some really hard problems particularly well. Are they hyping deep learning? Not as such. They are trying to bring attention to their products. Rick Rashid had a presentation that mentioned deep learning and some of the other firms have mentioned that they use deep learning in certain products (e.g. google voice search) but they don't really seem to be hyping the methods or techniques so much as their own particular successes with them. So who is hyping deep learning?
Deep learning techniques solve a lot of important problems and can almost certainly be applied to many more. But "deep learning" isn't a single technique, it is an attitude and approach to machine learning. So the reason you are hearing more about it is because big industry players are using it and some people in academia are getting a bit more traction in the broader ML community with the ideas they have been pushing, refining, and improving for years.
Shitty tech journalists and overoptimistic amateur "data science" clowns in the startup community that think these ideas are some sort of panacea are the people overhyping it.
I'm don't have a ton of expertise in the area, but I chat sometimes with people who do, so take this as a reasonably informed but lay opinion.
By and large, most uses of machine learning don't get much value from better machine learning in the sense that, given finite time and resources, you'll probably get better results on your problem by spending time on feature engineering instead of on machine learning improvements.
Some of the new deep learning techniques though have achieved good results on a large variety of problems in more of a "just plug it in" sense. My favorite example of this is this paper: http://machinelearning.wustl.edu/mlpapers/paper_files/ICML20... , which handles both sentence parsing and scene decomposition. They get competitive results on wide-ranging problem domains which have traditionally required lots of fine-tuning to do well on.
So the hope is that deep neural networks will provide improvements in the same way that SVMs did 20 years ago; that people will be able to just dump numbers into a more or less black box algorithm and get better results than they currently do. And indeed, I'm guessing people were as excited about SVMs in the 90s they are about deep neural networks today.
The advantage of deep learning is that you don't have to engineer features. It does the feature extraction for you, which saves engineers months of work.
Avoiding the need for feature engineering is one purported advantage of deep learning. The reality is that shallow methods also work on raw data. All you need a large data set and a model whose capacity can grow with the size of the dataset (kernel methods, random features, boosting, even polynomials).
Another purported advantage of deep nets is that they require less data to train. The hope is that hierarchical representations encode knowledge more efficiently than shallow ones. This is motivated by biology and by theoretical bounds on the size of circuits for representing boolean functions. But empirical evidence on real world data does not support this hope. One can obtain just as good a predictor as any deep net by training a shallow network with a comparable number of parameters (one recent example is "Kernel Methods Match Deep Neural Networks on TIMIT").
The third purported advantage is the mid layers of deep networks can be repurposed from one application to another. There a few convincing examples of this ("Analyzing the Performance of Multilayer Neural Networks for Object Recognition" is one example). That's likely a win over shallow networks because shallow networks have no intermediate representations to speak of.
My figure of merit for any supervised learning technique is a monotonic function of
1) test error of the final model
2) time to evaluate on a test sample
3) time to train
4) human effort to train
For deep nets, every improvement in 1-2 has come with terrible deterioration in 3-4. Furthermore, shallow nets still do better than deep nets in all four respects. Which, is exactly why deep networks interest me. I'm a sucker for the underdog.
This is also one of its major drawbacks, though, because the features it "discovers" are often not easily interpretable. Sure, you can get good classification performance, which is sometimes good enough.
I work with a lot of scientists, though, who don't just want performance, but insight. This is where neural networks and related techniques still fall short compared to more "traditional" methods.
Sorry for being unclear; this is the comparison I was trying to draw. The new techniques look like "magic" the way SVMs did when they first became popular. It triggers excitement the way it did when people realized that they could get near state of the art performance on the MNIST digits just by dumping raw pixel values into an SVM.
I disagree somewhat, in my opinion is rather more difficult to use. However, deep learning can solve problems that where previously insolvable. It's a step toward "real" artificial intelligence.
The hype is ultimately due to the very good performance recently (ie. much better than other methods, on learning tasks that people had been struggling to do better on for a long time) in a lot of practical applications like speech and image recognition.
For instance: speech recognition in android with unprecedented word error rate and facebook face recognition with human-like performance, and other language processing applications as sentiment analysis.
Support vector machines are not particularly interesting. They are just perceptrons with a lot of math. So ultimately souped-up linear classifiers. If you start learning the kernel in an SVM you get a deeper model. A neural net using the hinge loss and an L2 penalty on the weights is basically a primal SVM.
The math in "a lot of math" and the soup in "souped-up linear classifiers" are really not stuff to be casually swept under the carpet :) They still remain one of the best off-the-shelf learners one can use (look at "An Empirical Comparison of Supervised Learning Algorithms" by Rich Caruana et al).
This looks like it has the potential to be a great resource! Not to mention it's coming out of one of the big name schools of deep learning (along with UToronto and Stanford).
I imagine, though, that anyone not well versed in college mathematics may have issues with the explanations. If you want a good introductory resource, but either haven't covered or have forgotten some of the math in this book, I would recommend one of two resources:
The first will take you through all the math first through some online courses and textbooks, and the second is a good general purpose introduction that I recommend to anybody interested in neural nets.
It's been quite a while and even Ng has demonstrated that a billion parameter setup could be built for $20k using commodity hardware (https://news.ycombinator.com/item?id=5896684).
I wonder what's happening at Google labs as of August 2014.
The latter is a subset of the former and refers to particular architectures/strategies for doing machine learning/data mining. Think deep as in hierarchies not deep as in "whoa, man". Some ideas in deep learning are new; some are old. It's not entirely a rebranding.
Not here to be cynical/negative -- they might be of great value, this is not my expertise. Can someone explain why deep learning articles are receiving attention rather than, say, Support Vector Machines / kernel-based methods of pattern analysis? Or other nonlinear analysis? Are they related?