Neural Networks class on coursera covered a lot of the same topics with both heavy math theory crafting and hefty amount of practical application. https://www.coursera.org/course/neuralnets
>You have to realize that our theoretical tools are very weak. Sometimes, we have good mathematical intuitions for why a particular technique should work. Sometimes our intuition ends up being wrong [...] The questions become: how well does my method work on this particular problem, and how large is the set of problems on which it works well.
I'm not very familiar with this field. Has anyone made any progress on formalizing ways to measure the capabilities of intelligent systems? If the theory is weak, there must be someone working on improving it, right?
But since that's a $55M Black Hole with no published results other than a mostly meaningless claim to having solved Captcha (which wasn't all that tough a task to begin with), there's no way to tell since it doesn't seem like practitioners of the art are the ones evaluating his prospects for further funding. But don't believe some random dude on HN, here's Yann Le Cun saying pretty much the same thing:
Hey Michael, I loved your book on Quantum Computing, but don't get me started on D-Wave or as I see it: $15M for a huge magic box that might be faster than a $15,000 GPU cluster for some problems.
But seriously, the book rocked, and this one's coming along nicely.
This paper demonstrates something called "0 shot learning" where you can actually infer the correct label of an unseen image based on similarity among representations learned in a separate NLP task.
For instance, it can label an image "tiger" even if it has not seen tigers but has only learned about the word (and inferred its relation to cat, an image it has seen) from reading text.
It's not intelligent, not even close. But it's an awfully strange emergent phenomenon these concepts are demonstrating. Exciting stuff, I think.
It seems like the implicit target in the document is to achieve a critically damped system with no ringdown on learning. However, if they're trying to go for speed, then it seems like they should accept possible overshoot, and use non-linear control theory for their weights so that they're underdamped during the initial descent, and then transition into critically damped gradient descent as they move into the flat zone. Something like a variable "damper" or weights/springs based on current error. Perhaps that is done elsewhere though, and just not described as a technique here.
I found that the statement about the cross entropy not true. When y==a the function is non-monotonic with 0 at the extremes but not at the middle. So the "proof" shown is confusing to me.
The statement "if the neuron's actual output is close to the desired output, i.e., y=y(x) for all training inputs x, then the cross-entropy will be close to zero"
is not true. The function peaks in the middle (~ 0.7)
This is addressed in the marginal note attached to the sentence you quoted.
The essential point is that we're considering classification problems, for which the output is intended to be 0 or 1. I address the more general case of regression problems (where y may take any value) in a later exercise.
The sections on Regularization and Dropout have some amazing prose. I haven't read any of the other chapters, but just skimming through those sections have helped enlighten me on quite a few things that have confused me for years in completely different mediums...Such as why a random forest made up of randomly selected simple CARTs generally predicts better than a single complex CART, or why fitting a distribution to empirical data can benefit from using AIC or BIC methods.