Hacker News new | past | comments | ask | show | jobs | submit login
When not to use deep learning (hyperparameter.space)
254 points by jpn on July 10, 2017 | hide | past | favorite | 53 comments



My naïve understanding of deep learning is that it works by finding patterns in the answers, instead of actually solving problems.

If I take a multiple-choice exam and always answer "C", then I have a good chance at getting more than 25%.

For image recognition, I think the classifier is doing the real work (trying to actually answer the question), and the deep learning is just seeing if the answer matches the pattern of expected answers.

Somehow, this actually works. I think that it's because true randomness is hard to find.

The problem that I've found is that it's really difficult to teach deep learning. I'm making a Chinese-English teaching tool ( http://pingtype.github.io ) and sourcing my translations from Google Translate. I find a lot of mistakes in my dictionary that obviously came from Google's model getting the word spacing wrong. I can fix it in my own dictionary immediately. If I submit the correction to Google, it just changes some weightings, and hundreds of people will have to submit the same correction before their deep learning will finally catch on that it needs to change something.


Your naive understanding is supported by at least one deep learning authority:

> I haven’t found a way to properly articulate this yet but somehow everything we do in deep learning is memorization (interpolation, pattern recognition, etc) instead of thinking (extrapolation, induction, etc). I haven’t seen a single compelling example of a neural network that I would say “thinks”, in a very abstract and hard-to-define feeling of what properties that would have and what that would look like.

> All the while I'm thinking: this thinking process this person goes through as he analyzes this data: THAT is what Machine Learning SHOULD do

-- Andrej Karpathy

Deep learning for image recognition works because our visual world is made up of structured hierarchical features: Dark/Light, Texture, Edge, Part of Object, Object, Scene. Deep learning layers create increasingly higher-level features in a computationally feasible way.


So a better name for "deep learning" would be "shallow understanding"?


I personally prefer 'generic hashing/parsing'; deep learning excels at the automatic creation of a mapping of unstructured information to structured information, after a sufficient period of training.


Hmm... but isn't that what our brains do as well? Unstructured intensities of light bouncing off our retinas which becomes a structured recognized object.


It definitely seems to be part of what our brain does. The visual cortex is an apt comparison since that's where a lot of the structural inspiration for modern ANNs comes from. But, there does seem to be a little more than that too; it's not clear whether all the brain does is reducible to a hash function (reducible in any useful sense, at least; a very very very big, very very very sparse hash function, perhaps).


Our brain can understand that a cartoon-picture of a cat is a cat. Also, our brain can understand that a picture of a cat taken from a hugely different angle than seen before is a cat. Deep learning cannot do those kind of tricks.

A related problem is "one shot learning", [1].

[1] https://en.wikipedia.org/wiki/One-shot_learning


Maybe something like subtle memorization?


It's obfuscated memorization though. Otherwise wouldn't youtube search on my "smart" TV yield videos that contain the search term?

I searched for "face detection" and got "face recognition" videos. I felt like a linear model would have been more useful.


What if humans learn through memorisation and pattern recognition instead of thinking?


There's quite probably some of that. A quote from J.S. Mill on the distinction between science and technology strikes me as useful:

"One of the strongest reasons for drawing the line of separation clearly and broadly between science and art is the following:—That the principle of classification in science most conveniently follows the classification of causes, while arts must necessarily be classified according to the classification of the effects, the production of which is their appropriate end."

Essays on some unsettled Questions of Political Economy

http://www.gutenberg.org/ebooks/12004?msg=welcome_stranger#E...

Deep Learning is finding associated effects. It does not find the underlying causes. It is a mode of technical rather than scientific advance.


What are your thoughts on newer recurrent architectures like the DNC (or its predecessor, the neural Turing machine)? While the demonstrated results with DNCs so far are pretty limited, it seems that they embody a push towards allowing a neural network to actually "think" over multiple steps: storing complex information, formulating a plan, and acting on that plan.


Yes. I think these architectures are very exciting and a step in the "right" direction. Eventually we will want to move from rote memorization and pattern matching to more challenging aspects of intelligence.

https://arxiv.org/abs/1601.01705v4 (Learning to Compose Neural Networks for Question Answering) comes close to breaking this barrier.


As much as I dislike calling on the neural net / biological net metaphor, I do think that computer science has made some headway in how "useful codes", in the sense of semantically-meaningful interpolation, can be derived from natural scene stimuli, and therefore the onus that "we do something different" is to some extent now on the neuroscientists to think about and try to prove that "reasoning" in the human sense is anything other than an algebra of latent codes, i.e., linear or non-linear combinations of codified summaries of sensory input.


What do you mean by an "algebra of latent codes"?


I mean being able to combine latent codes through some form of algebra (e.g. linear combinations) and have it retain coherent semantics:

https://github.com/Newmu/dcgan_code/raw/master/images/faces_...


Geoff Hinton refers to thought vectors performing reasoning by analogy using algebra [1] in his Royal Society Lecture.

The other widely reported vector algebras in a semantic space were discovered by Mikolov et al when producing ~300 dimensional vectors for a billion word Wikipedia corpus.

If one performs vector algebra and ~= is near by cosine distance then using Mikolov's Vectors[3].

  King - Man + Woman ~= Queen

  France - Paris + Gernmany ~= Berlin
Surprisingly this works for other modalities, Chintala, Radford & Metz found a latent semantic space in images, that adds vectors for glasses or smiles to peoples faces. [4] With a generative model new images can be created as outlined in this blog post by Soumith [5]

Karpathy shows trained nets can be assembled like lego across modalities, slice off the classifier to reveal the rich semantic 'thought vector' layer of an Imagenet trained Alexnet, plug in a RNN sentence generator using word2vec and ( some over simplification ... ) you get a convincing image captioner [6].

The thought vectors are akin to high level representations of the world and can cross modalities . Text to Images using thought Vectors ( from hnnews discussion [7] )

So the vectors of though are in some way a an AI mentalese or encoding of a symbolic representation of the world derived from the data and can ( again drastic over simplification ) transfer modalities and even between previously unlinked languages [8]

Also see Anything2Vec https://gab41.lab41.org/anything2vec-e99ec0dc186

[1] https://youtu.be/izrG86jycck?t=25m58s

[2] The paper Geoff Hinton is reffering to : Sequence to Sequence Learning with Neural Networks by Ilya Sutskever, Oriol Vinyals, Quoc V. Le https://arxiv.org/abs/1409.3215

[3] Efficient Estimation of Word Representations in Vector Space by Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean https://arxiv.org/abs/1301.3781

[4] Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks Alec Radford, Luke Metz, Soumith Chintala https://arxiv.org/abs/1511.06434

[5] https://code.facebook.com/posts/1587249151575490/a-path-to-u...

[6] SF Machine Learning: Automated Image Captioning with ConvNets and Recurrent Nets by Karpathy https://youtu.be/ZkY7fAoaNcg?t=38m31s

[7] https://news.ycombinator.com/item?id=12366684

[8] https://github.com/Babylonpartners/fastText_multilingual


Don't just down vote this guys, that doesn't help.

No, your naive understanding is not correct.

'Deep learning', and by that I mean a neural network, works at a super high level by generalising some input (say X) into some related output (lets say, Y).

It's not just random choice; it's like defining a programming function:

    foo(x) -> y { ... }
Where the '...' is implemented by a set of statistical weights and training data and so on.

...but ultimately, you raise a valid point.

Training models is incredibly time consuming, and incorporating 'corrections' as new training data is extremely non trivial.

It's one of the issues with deep learning.

I would even go as far as to say, 'You need to regularly update your training data with new examples and counter examples' as a 'when not to use deep learning'.


There was no mention of random choice (was there?) and the naive description seems to line up with your description.


> I can fix it in my own dictionary immediately. If I submit the correction to Google, it just changes some weightings, and hundreds of people will have to submit the same correction before their deep learning will finally catch on that it needs to change something.

This isn't a problem with deep learning, but a problem with verifying a data source. You are trusting yourself to submit a proper correction, but Google doesn't know you from John.

To use another example, imagine that someone says 他的 is "they're", another person says it's "their", and still another person says it's "there". Your approach would just accept all three in a row, wouldn't it?

Google's approach presumably trusts that the crowd + what it sees on the web is correct and thus attempts to verify that a submission is typical language use before actually putting it into play.


I just tried your tool, and I must say it generated some impressive nonsense.

Putting in 这是什么东西 ("What is this thing") in the Chinese box yielded the translation "this is Why tender east west", which is a _character for character_ translation of the phrase. That's not especially meaningful or useful given that Chinese words are polysyllabic - I hope this is a temporary bug.

By contrast, Google Translate gives me the much more sensible "What is this" as a translation.


Thank you for trying it - the reason is that you entered simplified characters, but the tool is for traditional.

You can click Advanced -> Regional -> Simplified to Traditional.

Entering 這是什麼東西 gives "this is what thing", which has correct word spacing.


> My naïve understanding of deep learning is that it works by finding patterns in the answers, instead of actually solving problems.

Wow, I love this sentence!

Plato: A man is a featherless biped

Diogenes (plucking a chicken): Here's your man, Plato!

Plato (deep learning): A man is a flat-nailed featherless biped


Is it a bird? Is it a plane? No, it's Superman!

Superman is an alien.


> My naïve understanding of deep learning is that it works by finding patterns in the answers, instead of actually solving problems.

I think this is quite profound and inspiring.

Although perhaps it is only one half of it. Concretely, deep learning finds patterns, the best patterns are derived from the highest bandwidth signal, often this is the input.

Geoff Hinton has argued that the task, solving the problem, is a low bandwidth signal.

Hinton aphorises [approximately] 'If you want to learn computer vision first learn to do computer graphics, i.e. a generative model.' - this is about the bandwidth of the data signal.

Hinton: [1] "Each image has much more information in it than a typical label... Each image puts a lot of constraint on the identity function. Whereas if I give you an image and a label and I try and get the right answer I don't get much constraint on the mapping from image to label. The bits of constraint on that mapping imposed by training example are just the number of bits to say what the answer is which is not very many."

[1] @5:28 http://videolectures.net/mlss09uk_hinton_dbn/


Your brain also works finding patterns. The only difference is that your brain works for a lot of domains and can translate information of one domain to other. We have not find a good way to do this in deep learning yet. But once we find it, it will know that answer C all the time probably is not correct.


Finding patterns in answers is solving problems.


No, it's not.

Humans seem to solve problems with a combination of learned stock knowledge, induction, and constrained improvisation.

Constrained improvisation is the most interesting part of that process, and the one we know least about.

It's one thing building a system that asymptotically improves over millions of trials, and then sending out a press release claiming your system is as smart as a human.

It's another building a system that learns a domain as efficiently as a human.

Compare the relatively small number of games played/analysed by a Go master on their way to master status, compared with the number of simulated games played/analysed by AlphaGo.

ML is still a rather naive form of constrained brute forcing. It's a long way short of efficient learning.


> ML is still a rather naive form of constrained brute forcing. It's a long way short of efficient learning.

Uh... Bayesian can do the above at in term of expert domain. They call it elicitation in the Bayesian world.

I think you're overall generalizing all learning techniques. And also it's not like we actually really know how the human brain learn. Psychology is a field with huge uncertainties and you can see that in their research papers with correlation values. So the concept of learning may be out dated and/or we are still learning about what makes us learn.


You're describing a technique for finding patterns. There are many possible techniques and machines can (probably) use any of them. The constraint would be if the technique involved kinetic or quantum mechanics.


Maybe more precisely finding patterns in answers in relation to the inputs.


But it doesn't show its work.


I have the same question about Google translate. I'm not convinced that its errors can simply be chalked up to neural networks. As you said, they ought to be constantly trying to improve their word segmentation and datasets.

I'm not convinced that Google Translate is as major of a focus at Google as other things.

Perhaps I'm wrong, and it's just a hard problem, but the translations I've seen haven't improved as much over the years as I expected given progress in other areas of AI.

They may be more focused on adding new languages than improving existing ones. Don't know.

Japanese translation is pretty good. Chinese is bad. I guess Chinese is harder because nobody really uses any phonetic writing for reading beyond grade schools. And then there is traditional/simplified, and, I imagine, other differences between how different regions use characters. Baidu's translate is better than Facebook/Google in some cases.

I wonder if translation just isn't as sexy as image recognition and self driving cars, therefore research dollars aren't as focused on it.


You're wrong. It gets a ton of focus. Here's the largest update in years (from 10 months ago): https://research.googleblog.com/2016/09/a-neural-network-for...

It's a very hard problem. There are a ton of people at Google, baidu, and elsewhere working on it. (Source: I'm a part timer on Google Brain)


Do you think with more research dollars it could be improved sooner?


Machine translation has a lot of research funding, and has had it for decades.

I'm not sure about USA, but it's been a major focus of very large EU grant programs due to the obvious multilinguality of EU and the explicit goal to move towards a single European market by reducing barriers in trade, including language barrier.

It also has many commercial use cases and thus has always had quite a lot of people and teams working on it compared to other fields of ML or NLP.

The problem is that it's hard. Every 0.1% of progress has historically required a lot of work.


Now that the general public understands a bit more about NN's , AI and machine learning, maybe there will be a renewed push to invest in even better machine translation.

More research into theory, more / better data from professional translators, and more efficient implementations from engineers.


Finding patterns in answers isn't unique to deep learning. All machine learning training methods rely on this at some level.


I was expecting more discussion of alternatives.

For instance, in cases where deep neural networks aren't desirable or don't outperform classical approaches, I'm a big fan of boosted decision trees, due to their accuracy on many real-world datasets, their ease-of-use, and the existence of great open source implementations. xgboost (which routinely wins Kaggle competitions) and Spark MLLib both have high-performance distributed training algorithms for gradient boosted trees. And as far as hyperparameter searches go, there just aren't as many parameters to optimize. (And frameworks like Spark are already fantastic for embarrassingly parallel tasks like hyperparameter searches.)


Spark MLLib both have high-performance distributed training algorithms for gradient boosted trees.

Well it exists, but I wouldn't describe it as high performance in either accuracy or speed.

I'm a big fan of Spark, but Spark ML needs some love from people who actually use it.

Until that happens, just use XGB (which now has Spark integration[1])

[1] http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgb...


The author discusses how linear models are generally more interpretable than deep learning methods, but I'd argue that's actually changing pretty quickly. Especially for large image/sequence inputs (which covers most of the applications that are getting hyped up), linear regressions don't perform very well, and often that performance difference prevents them from picking out important features. Given that fast, scalable methods for feature importance are on the rise (e.g. https://arxiv.org/abs/1704.02685, which the author mentions), you often get equally interpretable feature scores from deep models that are more accurate than analogous ones from linear models.

Basically, my point is that model interpretation strongly depends on how accurate your model is, and because deep learning models are so much better than linear models for some tasks, it makes sense to use them - even if your primary goal is interpretability.

That said, I do believe that if you ever care at all about interpretation, you should almost never be using multilayer perceptrons (which have recently become part of the widening umbrella term "deep learning"), because they rarely work better than decision tree models or basic linear models (and MLPs are generally less or equally as interpretable when compared to traditional methods).


Feature importance is not quite the same as interpretability.

Random Forests can give feature importance, but that does not account for interactions between features. So, in the end, you don't know how a model made a decision (it could be because there is a feature with high importance, but it could also be because there is an informative interaction between lower importance features).

If you want to compare deep learning with linear models, you should leave image data out of it. Compare them on structured data and bag of words.

MLP's and boosted decision trees, in my experience, definitely beat decision tree and linear models, on structured data. But they lack longterm robustness (complex forecasting models need constant retraining, which can hamper their adoption by business units) and don't pass regulation (it is not enough to say "has_asthma" is a high-importance feature).

In finance and health care, interpretability is enormously valued. It is a constant trade-off between accuracy and interpretability.

A long time ago, Caruana made hospital triage models, with neural networks being the clear winner in generalization performance. Instead, they opted for a simple logistic regression when productionizing. Why?

> [...] patients with pneumonia who have a history of asthma have lower risk of dying from pneumonia than the general population. Needless to say, this rule is counterintuitive. But it reflected a true pattern in the training data: patients with a history of asthma who presented with pneumonia usually were admitted not only to the hospital but directly to the ICU (Intensive Care Unit). The good news is that the aggressive care received by asthmatic pneumonia patients was so effective that it lowered their risk of dying from pneumonia compared to the general population. The bad news is that because the prognosis for these patients is better than average, models trained on the data incorrectly learn that asthma lowers risk, when in fact asthmatics have much higher risk (if not hospitalized).

http://people.dbmi.columbia.edu/noemie/papers/15kdd.pdf

Though there is nothing holding you back from using both simple linear, and complex non-linear models at the same time: Only when the models severely disagree do you pick the interpretable model. Or use the linear model to find data issues, like those mentioned above, that are tremendously obscured (if not impossible to identify) when only using deep learning in a train-test framework.


>"In the study, the goal was to predict the probability of death (POD) for patients with pneumonia so that high-risk patients could be admitted to the hospital while low-risk patients were treated as outpatients."

So what they wanted to know is the POD|"No hospital" but they clearly collected data about POD|"Hospital" (since it included ICU admission, etc).

The problem is they measured the wrong thing and then misinterpreted their results. Worse, it looks like the study was designed to be this way!


>"The bad news is that because the prognosis for these patients is better than average, models trained on the data incorrectly learn that asthma lowers risk, when in fact asthmatics have much higher risk (if not hospitalized)."

The model learned correctly in this scenario. If you go to the hospital for pneumonia it is apparently in your best interest to claim a history of asthma.


The anecdote about pneumonia and the ICU is pretty puzzling. Why wasn't submission to the ICU one of the classification "labels"?


Here is a talk about that paper: https://www.youtube.com/watch?v=UqPcq0n59rQ

I see that it has also gotten mainstream news coverage as some kind of lesson about the dangers of machine learning. The real problem is they didn't have data that could answer the question they had, P(Death|No hospitalization), so instead they fit models to answer a different question, P(Death|Hospitalization).

Then they didn't like that the complex models answered the second question too well, so they used simpler ones that made it easier to manually filter out any results that didn't make sense as answers to the first question (which isn't one they could answer to begin with).

No model they fit is safe. You could only use one limited to domains where P(Death|No hospitalization) ~ P(Death|Hospitalization), which isn't something they assessed.


But what is the situation in real life? Can I get some feature importance scores say from tensorflow model?


I'll agree with you that it's much harder than it should be (thankfully, finding the implementations is the hard part, not using them), but yes, these methods do exist.

DeepLIFT (the method I linked in my original comment: https://github.com/kundajelab/deeplift), takes a Keras model (with Theano or TensorFlow backend) as input and provides feature importance scores for any desired layer of the network (raw data inputs, inputs to dense layers following convolution, etc.). Keras-Vis (https://github.com/raghakot/keras-vis) is another nice package that allows for easy visualization of saliency maps and convolutional filters. Perturbing inputs and looking at the effect on the output of the network is another technique people use pretty often.

I think there's a lot of room for this space to become easier to use, especially for newer deep learning practitioners. To that point, I definitely agree with the author of this blog post.


Thanks for the links - a friend of mine is working on something like DeepLIFT but I hadn't heard of it...


"The point is that training deep nets carries a big cost, in both computational and debugging time. Such expense doesn’t make sense for lots of day-to-day prediction problems and the ROI of tweaking a deep net to them, even when tweaking small networks, might be too low. "

As a Masters student now training deep models for a little while now, I think this point is underemphasized. Doing something novel (so, not just image classification) requires a TON of engineering, not to mention the research considerations. And there are so many tiny decisions and hyperparameters, that even when I thought I had considerable domain knowledge I found it very lacking. I guess it should not be surprising given that 'Deep Learning' refers to a very broad set of models only related by having a learned hierarchical representation. There are a few problems where you can use existing deep learning almost off the shelf (most notably image classification, segmentation), but for most applications I think we're not there yet. As long as this remains true (which I suspect will be for a long time), SVMs and decision trees and linear models are still definitely worth knowing and understanding.


If NN was able to do small data then are they better than their counter parts?

I mean if you can do it for small data and it was good then we would be seeing it dominate kaggle in all problem domains. Maybe the small data problems belong to other algorithm (such as tree base and forest, SVM).

disclaimer - I'm bias for tree base algorithm in medium and small data since it is my thesis.


I think no - SVMs are explicitly optimized to generalize the best from small data (https://en.wikipedia.org/wiki/Hinge_loss - 'The hinge loss is used for "maximum-margin" classification'), whereas NNs have more hacky regularization methods. I am not sure if the same is true for tree-based methods, but of course those are lovely due to how interpretable they are when you have few features.


Pretty much agree, and particularly on the budget/time aspect.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: