LSTMs are both amazing and not quite good enough. They seem to be too complicated for what they do well, and not quite complex enough for what they can't do so well. The main limitation is that they mix structure with style, or type with value. For example, if you want an LSTM to learn addition, if you taught it to operate on numbers of 6 digits it won't be able to generalize on numbers of 20 digits.
That's because it doesn't factorize the input into separate meaningful parts. The next step in LSTMs will be to operate over relational graphs so they only have to learn function and not structure at the same time. That way they will be able to generalize more between different situations and be much more useful.
Graphs can be represented as adjacency matrices and data as vectors. By multiplying vector with matrix, you can do graph computation. Recurring graph computations are a lot like LSTMs. That's why I think LSTMs are going to become more invariant to permutation and object composition in the future, by using graph data representation instead of flat euclidean vectors, and typed data instead of untyped data. So they are going to become strongly typed, graph RNNs. With such toys we can do visual and text based reasoning, and physical simulation.
What you describe is very similar to the concept of Differentiable Neural Computer, from DeepMind [1]. While still experimental they already have nice results.
I'd put a little more faith in LSTMs. There's a lot less evidence for what they can't do than what they can. With enough fiddling, can get LSTMs to work for most tasks.
I'm not sure they're quite as complicated as you're making them out to be. If they are, then try a GRU instead ;)
I personally find recurrent highway networks (RHNs) as described in [1] to be easier to understand and remember the formulas for than the original LSTM. Because as they are generalizations of LSTM, if one understands RHNs, one can understand LSTMs as just a particular case of RHN.
Instead of handwaving about "forgetting", it is IMO better to understand the problem of vanishing gradients and how can forget gates actually help with them.
And Jürgen Schmidhuber, the inventor of LSTM, is a co-author of the RHN paper.
In the experiment on teaching an LSTM to count, it's useful to note that the examples it's trained on are derivations [1] from a grammar a^nb^n (with n > 0), a classic example of a Context-Freee Grammar (CFG).
It's well understood that CFGs can not be induced from examples. Which accounts for the fact that LSTMs cannot learn "counting" in this manner, nor indeed can any other learning method that learns from examples.
_______________
[1] "Strings generated from"
[2] The same goes for any formal grammars other than finite ones, as in simpler than regular.
Aw man! That 1994 paper by Sakakibara is a piece of history! It concludes by
saying that it's "part of the work in the major R&D of the Fifth Generation
Computer Project conducted under the program set up by MITI" [1]. Plus, Sakakibara
is one of my grammar induction heroes :0
However- his algorithm learns CFGs from structural data, which is to say,
derivation trees (think parse trees). So it's completely irrelevant to the
example in the article, that attempts to learn a^nb^n from examples of its
derivations -which remains impossible.
As to the other paper, by Chen, Tseng and Chen, that's about learning a CFG
that reproduces the strings in a corpus- so learning a CFG of a corpus as
opposed to the grammar of a context-free language (therefore, a context-free
grammar) which, again, remains impossible.
From a quick glance, the example doesn't quite learn regular grammars - rather, it reduces them to finite grammars (e.g. the finite grammar of all US presidential winners and losers) and learns (or, more accurately, invents) a regular expression for them.
Finite grammar induction from positive examples only is feasible in polynomial time, so Peter Norvig's notebook will not cause the fabric of the space-time continuum to be torn asunder, I am sure.
It's not clear that the learnability results about formal grammars are useful for describing what can be learned by practical algorithms, where the grammar does not have to be identified with perfect confidence.
Counting is one thing and I'm not sure whether this is possible to do or not (my understanding is that it's "not"), but the example in the original article is specifically trying to learn to reproduce strings of exactly n a's followed by exactly n b's, for arbitrary positive n.
That's learning the CFG a^nb^n for n > 0 in a nutshell and is therefore impossible to learn, and indeed the LSTM in the article falters after test-time n exceeds training-time n by a few units.
The paper you link to discusses subject matter I'm not familiar with (bootsrapping as in psychology, say) so I'm not really qualified to opine on it. However, from a very quick look it seems like they're learning to associate counting words with numbers- sorry if I'm wrong about that. Still, that's not "learning to count", where you have to be able to make the inductive leap that n + 1 > n, for all n in your target set of numbers in addition to learning the recursive structure of number strings.
Finally, from the little I've seen of attempts to model counting with machine learning, there hasn't yet been much success, probably because it's impossible to learn a language with both structure and meaning from examples only of its structure and no access to their meaning.
LSTMs are on their retour in my opinion. They are a hack to make memory in recurrent networks more persistent. In practice they overfit too easy. They are being replaced with convolutional networks. Have a look at the latest paper from Facebook about translation for more details.
The way I see it, the difference is that with CNN you have fixed maximum timeframe in which knowledge about world is preserved, while LSTMs and RNNs in general do not impose such restrictions. This makes them better suited for some applications.
Yes the bulk of our business is time series. This includes everything from hardware break downs to fraud detection.
I think Jeremy has some good points but in general, but I wouldn't assume that everything is binary. (By this, I mean
look at these kinds of terse statements with a bit of nuance)
Usually as long as you have a high amount of regularization and use truncated backprop through time in training you can learn some fairly useful classification and forecasting problems.
Beyond that standard neural net tuning applies. Eg: normalize your data, pay attention to your weight initialization, understand what loss function you're using,..
LSTMs don't "forget" more than "remember the things that matter". They don't necessarily need less data. They do have a limit on the "length" of time steps they can handle though.
Eg: You can't do thousands in to the future (maybe a few hundred or so)
The long part of "LSTM" means remember good long ranging dependencies.
I said, IIRC, that they're very similar in terms of results and GRUs are a little simpler. I've seen some papers show better results for GRU, and visa versa.
Oh also I did say that CNNs often beat RNNs, even for time series, language, and audio. So I generally try them first. You can always stick an RNN on top of the CNN and get the best of both worlds.
I just wanted to take the chance to say thank you for the course. As someone who is interested in the topic but isn't a practicioner, just watching it all work in practice has been fascinating and highly enlightening.
Is there code for the coloring of neurons per-character as in the post? I've seen that type of visualization on similar posts and am curious if there is a library for it. (the original char-rnn post [http://karpathy.github.io/2015/05/21/rnn-effectiveness/] indicates that it is custom Code/CSS/HTML)
Is the code for generating the reactions from the LSTM hidden units posted anywhere? That was the best part for me and I'd love to use it in my own projects.
In Andrej Karpathy's excellent blogpost on RNNs[1], he links to some of the code he used for visualisation[2]. I have done something similar. In general: put your the activations of each cell memory over time through tanh and color it based on that.
Neural nets that update over time need a way to know what to keep and what to forget from last time. So let's learn what to keep and forget... by using more neural nets.
Disclaimer: I just learned what an LSTM was. But it's a good article.
Knowing what the actual units do, I prefer spelling out the acronym as Long Short-Term Memory. They implement short-term memory, like all RNNs do, with a slight improvement to make it long.
That's because it doesn't factorize the input into separate meaningful parts. The next step in LSTMs will be to operate over relational graphs so they only have to learn function and not structure at the same time. That way they will be able to generalize more between different situations and be much more useful.
Graphs can be represented as adjacency matrices and data as vectors. By multiplying vector with matrix, you can do graph computation. Recurring graph computations are a lot like LSTMs. That's why I think LSTMs are going to become more invariant to permutation and object composition in the future, by using graph data representation instead of flat euclidean vectors, and typed data instead of untyped data. So they are going to become strongly typed, graph RNNs. With such toys we can do visual and text based reasoning, and physical simulation.