Large Scale Distributed Deep Networks (by Jeff Dean et al)

michael_nielsen · on Nov 13, 2012

Also submitted yesterday, by someone else, but it got buried: http://news.ycombinator.com/item?id=4775644

I resubmitted since I'm pretty sure this is of interest to many HN readers. Examples of why it's interesting include:

1. Google's deep learning work is now being used to power Android voice search (http://googleresearch.blogspot.ca/2012/08/speech-recognition... ); and

2. Dean claims that "We are seeing better than human-level performance in some visual tasks," in particular, for the problem of extracting house numbers in photos taken by Google's Street View cars, a job that used to be done by a large team of people (http://www.technologyreview.com/news/429442/google-puts-its-... ).

dumitrue · on Nov 13, 2012

It's not obvious to me what it means to have "better than human-level performance" since most of the time the ground-truth itself is defined by humans :)

dsl · on Nov 13, 2012

One example I can think of is a computer could read a house number or a street sign from 100 feet away, where a human with good vision might be able to make out the same text at 20 feet.

robrenaud · on Nov 13, 2012

You could also ask many humans and average their response to obtain a golden label, and see how well any particular human agrees with the average. If there is a lot of variance in the human answers, then it's possible for a machine to have better than (individual) human performance, even on a human labelled data set.

jasonwatkinspdx · on Nov 13, 2012

In production you may use a process that's more economized than what you'd use to establish a ground truth in research. So the human team in production will have some error rate that is tolerated or won't include as much redundancy. In this sense it's easy to imagine better than human performance from the software, in the sense that the human performance you get isn't a single max value, but rather a function of budget.

fudged71 · on Nov 14, 2012

CATCHAS, ironically, often seem to require "better than human-level performance" for recognition

ArbitraryLimits · on Nov 14, 2012

> "We are seeing better than human-level performance in some visual tasks," in particular, for the problem of extracting house numbers in photos taken by Google's Street View cars

Can someone explain to me how this is news, given that the handwritten addresses on snail mail envelopes in the US have been OCR'd by neural networks for more than twenty years now?

tgflynn · on Nov 14, 2012

The USPS address recognition technology (at least as of 2 years ago when I was working on it) is not human-level. A fraction of the images cannot be resolved and are still sent to human keyers at rec sites. This fraction has been decreasing steadily over the years but it has not yet fallen to zero.

It's important to remember that performance is critical when talking about machine perception. OCR, handwriting recognition, face recognition, etc. can all be done but at what level of accuracy ? At least until very recently machine performance on these tasks has fallen well short of human level abilities.

ajays · on Nov 14, 2012

Handwritten addresses aren't fully OCRed for the system to work. I worked on the first systems that were released (in the 90s), and the basic algorithm was as follows: first, try to read all the numbers in the address, and identify the ZIP and the street number. Now, given the ZIP and the street number, the number of possible street names is very small (on average, 4 or 5); this is done via the USPS's address database. Now the problem becomes one of matching the handwritten street name with one of these 4-5 names. (Of course, there's more to it, but this is the gist of it).

Last I heard, the percentage of handwritten mail successfully sorted by a machine had reached in the low 80s. The Russian company Parasoft (who also worked on Newton's online handwriting recognition) has been the leader in this field.

confluence · on Nov 14, 2012

The order of complexity difference between recognising figures on plain paper perpendicular to a scanning device in a controlled environment, and doing the same thing on huge amounts of non-standard chaotic data is why.

ArbitraryLimits · on Nov 14, 2012

Virtually all house numbers are either painted from a stencil or composed of mass-produced shapes on a background of uniform color, whereas addresses on envelopes are handwritten by doctors, six-year-olds and people with Parkinson's disease. I'm not convinced it's a harder problem.

confluence · on Nov 14, 2012

What don't you get? One is on a white background. One is in random orientations, placed in complex scenes, with random fonts, positions, numbers, sizes, shapes and locations and you don't even know where they are.

It's like a game of "Where's Waldo" on freaking crack.

You have literally no idea how complex this stuff is now then do you?

ArbitraryLimits · on Nov 14, 2012

Compare the addresses at http://www.realsimple.com/home-organizing/decorating/eye-cat... and http://mandydouglass.blogspot.com/2010/10/addressing-envelop... . Those are two representative images I picked from the first google hits for "house number" and "handwritten address" respectively - all the others were comparable. Are you seriously going to claim that the house number is harder to recognize than the handwriting?

I do have an idea (literally, even) that there are additional problems having to do with extracting the house number images themselves from full-motion video, but that's an image registration problem and not an object recognition problem.

confluence · on Nov 14, 2012

Those aren't the pictures Google works off - cf. maps.

bravura · on Nov 13, 2012

I did my postdoctoral research on deep learning, and got into it back when Geoff Hinton's work was just an unpublished tech report.

So if anyone has any questions on it, I will try to answer.

e98cuenc · on Nov 13, 2012

Is there any open source ML framework that includes support for DNN?

Do you know of any tutorial that may guide the beginner using DNN, I have no idea how to choose the number of hidden layers and activation functions.

Thanks!

dumitrue · on Nov 13, 2012

http://deeplearning.net/software/theano/ is a good place to start. It's open source, in Python, has a few tutorials that lead you towards some rather state-of-the-art methods.

There's no secret sauce for how to choose the number of layers and the activation functions (and anyone that tells you otherwise is lying). It's all application-dependent and the best way to do it is by cross-validation.

Can answer questions about this topic in conjunction with mr. bravura (I'm working with the code referenced in that paper by Dean et al., here at Google, and did grad school at the same place as bravura did his postdoc).

e98cuenc · on Nov 13, 2012

I'm a xoogler, stuff like this makes me wish I was back there :) I want to play a bit with this stuff and have some fun, so thanks for the Theano reference, seems cool!

WRT the DNN parameters, is it possible to try all potential possibilities (within reason) and find the best one using only cross-validation, or are there just too many choices and you have to use intuition? (from your comment I don't get if cross-validation is ok to get optimal number of layers etc. or if you have to be "smart")

Thanks for replying!

dumitrue · on Nov 13, 2012

For things like the number of layers, the total number of options is relatively small -- usually people try between 1 and 6-7. For most of the other parameters you have to be smarter than that, especially since a lot of them are real-valued, so you can't really explore them all.

One of the trends these days is to perform automatic hyper-parameter tuning, especially for cases where a full exploration of hyper-parameters via grid-search would mean a combinatorial explosion of possibilities (and for neural networks you can conceivably explore dozen of hyper-parameters). A friend of mine just got a paper published at NIPS (same conference) on using Bayesian optimization/Gaussian processes for optimizing the hyper-parameters of a model -- http://www.dmi.usherb.ca/~larocheh/publications/gpopt_nips.p.... They get better than state of the art results on a couple of benchmarks, which is neat.

The code is public -- http://www.cs.toronto.edu/~jasper/software.html -- and in python, so you could potentially try it out (runs on EC2, too).

Btw, Geoff Hinton is teaching an introductory neural nets class on Coursera these days, you should check it out, he's a great teacher. Also, you can always come back to Google, we're doing cool stuff with this :)

seiji · on Nov 13, 2012

Ask me again in two weeks when my online course in advanced machine learning for extreme beginners kicks off. We won't start with deep learning immediately, but it's in the pipeline not too soon after launch.

e98cuenc · on Nov 13, 2012

I hope I'm not getting too off-topic, but will this course be taught in coursera or somewhere else? (I want to put a reminder to check this out).

Thanks!

seiji · on Nov 13, 2012

It'll be on a new site for teaching complex things in non-complicated ways. The goal is to allow everyone from clever middle school students through retired people to understand the coming changes to the world. There's a huge on-site community focus too. We don't want there to be 100,000 anonymous people just going through the motions. There will be plenty of interaction between course material and community feedback. It's kinda awesome.

Topics will be presented in multiple ways (simple and intermediate) so you can have plenty of different views on the same material. The material works as both zero-knowledge intro to the topics as well as quick refreshers if you haven't seen the material in a while (quick -- what's an eigenvector?!).

The launch courses will be 1.) real-world applications of probability and statistics (signal extractions), 2.) linear algebra for computer science, and 3.) wildcard (a random assortment of whatever the heck we think is important or entertaining to know). Future courses are: introduction to neural networks, introduction to computational neuroscience, introduction to deep learning, advanced deep learning, how to take over the world with a few dozen GPUs, avenues by which google will become irrelevant, and robotics for fun and evil.

This is phase zero of a four phase plan. I'll get some pre-launch material together to shove down HN shortly, then it'll launch a few days later. Hopefully you'll hear about the project again.

stephenlee · on Nov 14, 2012

Great work. It's so cool, can't wait to take part in.

tchalla · on Nov 14, 2012

Is there a link where we can keep a track on? I am "deeply" (pun intended) interested in this.

krickle · on Nov 14, 2012

Something I have been curious about for a bit; how do you include textual data as input into a NN?

dumitrue · on Nov 14, 2012

Here's a paper describing the main idea behind doing this: http://ronan.collobert.com/pub/matos/2011_nlp_jmlr.pdf

In a nutshell, you learn a vector of real-valued parameters for each word in your vocabulary. To train a network on sequences of words, you represent said sequence as a concatenation of the vectors of these words, and feed it as an input to the network.

To learn these vectors, you define the problem of "language modeling" as that of discriminating between two sets of sequences: S1 and S2. S1 is the set of sequences that occur in Wikipedia (of which there are many) and S2 is the same as S1, but where you replace a word in each sequence with a randomly chosen word from your vocabulary (which makes it, with very high probability, an invalid sequence of words).

Basically, by learning to discriminative between "good" and "bad" English word sequences, you can learn a language model of sorts. The model is represented by those vectors for each word.

You can then project those vectors into 2D, as bravura did a while ago, and look at what is close to each other: http://www.cs.toronto.edu/~hinton/turian.png

marshallp · on Nov 14, 2012

The new distbelief paper doesn't explain some things. Why are "receptive fields". An earlier paper claimed better than 15% results on imagenet and so does this, what changed. Why have you claimed sgd works on nonconvex well without comparing against actual nonconvex opto algs. You also didn't cite my comments on metaoptimize where some of your ideas seem to come from.

junto · on Nov 14, 2012

With regards to distributed processing of neural networks, how do you split up the neural network onto the various nodes? Do you cut it up into sections and then compute the combined weights from each nodes outputs?

ajays · on Nov 14, 2012

My NN knowledge is minimal, but from what I recall, the claim always was that you don't need more than 1 hidden layer; and too many hidden neurons results in over-fitting.

What changed?

tgflynn · on Nov 14, 2012

The claim that you don't need more than 1 hidden layer is based on a mathematical theorem that says roughly that a sufficiently large 3 layer net exists that can fit any sufficiently smooth function. One has to look in detail at things like "sufficiently smooth" and "sufficiently large" when applying mathematical theorems to real problems - and that's a step practitioners often seem to neglect when looking for rules of thumb. Also just because a net exists doesn't mean that any given training algorithm is likely to find it.

As for overfitting the best way to reduce it is to use more training data and I believe the nets discussed in these papers have been trained on some of the largest training sets ever used.

The other problem with deep networks is that they have been considered very difficult to train with backpropagation due to vanishing or exploding gradients. I think the recent major algorithmic developments have mainly involved methods to mitigate these problems.

ArbitraryLimits · on Nov 14, 2012

Please tell me whether the term "deep learning" has any meaning at all, and if so, what is it? It sounds like a buzzword to me, like "data scientist."

dumitrue · on Nov 14, 2012

To some extent it's true that "deep learning" is a buzzword, though it is less arbitrary that "data scientist" (which makes no sense whatsoever). In a nutshell, what people mean by "deep learning" is the collection of tips and tricks that make it possible to efficiently train multiple layers of non-linear hidden units (or feature extractors).

Sometimes these are multi-layer neural nets, but they don't have to be.

fudged71 · on Nov 14, 2012

I'm in the middle of a machine learning class and it's all very interesting, but I've got one question for you:

How difficult is it to make a neural net "forget"? If you've already trained over a large set of data, can you un-train it for some of the inputs?

dumitrue · on Nov 14, 2012

That's an interesting (and relevant) question. In fact, if you're using something like stochastic gradient descent to optimize the weights of your network, it might be very hard for the network to escape the general local minimum (or basin of attraction) in which it ended after having trained over a large dataset, even if you present it with the same examples but with the labels flipped (which would be the easiest way to "unlearn").

In theory stochastic gradient descent allows you to escape local minima: the noise in the stochastic error surface will likely be enough for the network to escape whatever minimum it is in. Practically, because weights will tend to have a large magnitude and because most of the times you'll be using a saturating non-linearity (such as a sigmoid), the number of steps required to escape that local minimum might be too big.

Presumably, you could use second-order optimization methods to perhaps escape from minima -- because it allows you to make "bigger" steps -- but that comes with its own set of problems (negative curvature being one of them).

I encourage you to actually test these hypotheses: train a simple network on something stupid like MNIST, and make it achieve a reasonable error with many passes through the data. Then change the labels of 10-20-50% of your inputs and continue training (with the same learning rate... or not!) to see how long it takes for the network to get to another minimum.

samg_ · on Nov 13, 2012

A year ago, I would have no idea what this was about. I'm very thankful that I live in a world where I can take high quality classes from Coursera that have given me the foundation to at least understand the abstract. :)

junto · on Nov 14, 2012

Which courses did you take out of interest?

I struggle with the mathematics used in neural networks. I can understand code but as soon as I start to see calculus my brain freezes over. Does anyone know of a good online course that can give me a crash course in the mathematics required for neural networks?

My CS bachelors covered this, but I was lazy and drank too much beer. Now 20 years later I want to understand it properly.

samg_ · on Nov 15, 2012

I've taken Andrew Ng's Machine Learning class, Daphne Koller's Probabilistic Graphical Models class, Dan Jurafsky and Christopher Manning's Natural Language Processing class, and currently Geoff Hinton's Neural Networks class.

I have spent a lot of time on Khan Academy to learn the calculus. In my experience you can get by with a surprisingly small amount of calculus, but it happens to be a small amount from a high level.

For example, backpropagation is just repeated application of the chain rule. Did take a while to get a handle on the derivatives, but it's worth it.

dave_sullivan · on Nov 15, 2012

Do the theano deep learning tutorials. And keep hacking away at the math--you need it, but it eventually sinks in and becomes reasonably intuitive. Starting with code helped me grasp the math (I'm also much more comfortable reading code than math).

feniv · on Nov 13, 2012

This is written by THE Jeff Dean, who is essentially the programming equivalent of Chuck Norris at Google ( http://www.quora.com/Jeff-Dean/What-are-all-the-Jeff-Dean-fa... )