Exploring LSTMs

visarga · on June 10, 2017

LSTMs are both amazing and not quite good enough. They seem to be too complicated for what they do well, and not quite complex enough for what they can't do so well. The main limitation is that they mix structure with style, or type with value. For example, if you want an LSTM to learn addition, if you taught it to operate on numbers of 6 digits it won't be able to generalize on numbers of 20 digits.

That's because it doesn't factorize the input into separate meaningful parts. The next step in LSTMs will be to operate over relational graphs so they only have to learn function and not structure at the same time. That way they will be able to generalize more between different situations and be much more useful.

Graphs can be represented as adjacency matrices and data as vectors. By multiplying vector with matrix, you can do graph computation. Recurring graph computations are a lot like LSTMs. That's why I think LSTMs are going to become more invariant to permutation and object composition in the future, by using graph data representation instead of flat euclidean vectors, and typed data instead of untyped data. So they are going to become strongly typed, graph RNNs. With such toys we can do visual and text based reasoning, and physical simulation.

mikkel · on June 10, 2017

Deep mind just released a couple of papers about this:

https://deepmind.com/blog/neural-approach-relational-reasoni...

Here is an implementation of "A simple neural network module for relational reasoning" in pytorch: https://github.com/kimhc6028/relational-networks

ghgr · on June 10, 2017

What you describe is very similar to the concept of Differentiable Neural Computer, from DeepMind [1]. While still experimental they already have nice results.

[1] https://deepmind.com/blog/differentiable-neural-computers/

mrdrozdov · on June 10, 2017

I'd put a little more faith in LSTMs. There's a lot less evidence for what they can't do than what they can. With enough fiddling, can get LSTMs to work for most tasks.

I'm not sure they're quite as complicated as you're making them out to be. If they are, then try a GRU instead ;)

LukeB42 · on June 10, 2017

Any links to implementations?

visarga · on June 10, 2017

Yes, one of the links is from DeepMind: "A simple neural network module for relational reasoning" https://arxiv.org/abs/1706.01427

Another one is from Thomas Kipf: "Graph Convolutional Matrix Completion" https://arxiv.org/abs/1706.02263

LukeB42 · on June 10, 2017

Thanks for elucidating the contents of the DeepMind paper and for hipping me to the Graph Convolutional Matrix Completion paper.

inlineint · on June 10, 2017

I personally find recurrent highway networks (RHNs) as described in [1] to be easier to understand and remember the formulas for than the original LSTM. Because as they are generalizations of LSTM, if one understands RHNs, one can understand LSTMs as just a particular case of RHN.

Instead of handwaving about "forgetting", it is IMO better to understand the problem of vanishing gradients and how can forget gates actually help with them.

And Jürgen Schmidhuber, the inventor of LSTM, is a co-author of the RHN paper.

[1] https://arxiv.org/abs/1607.03474

make3 · on June 10, 2017

Yes, but LSTMs are fucking everywhere in the industry so understanding them is crucial

YeGoblynQueenne · on June 10, 2017

In the experiment on teaching an LSTM to count, it's useful to note that the examples it's trained on are derivations [1] from a grammar a^nb^n (with n > 0), a classic example of a Context-Freee Grammar (CFG).

It's well understood that CFGs can not be induced from examples. Which accounts for the fact that LSTMs cannot learn "counting" in this manner, nor indeed can any other learning method that learns from examples.

_______________

[1] "Strings generated from"

[2] The same goes for any formal grammars other than finite ones, as in simpler than regular.

MrQuincle · on June 10, 2017

> It's well understood that CFGs can not be induced from examples

I think you mean something more specific (e.g. polynomial in a particular sense).

+ Automatic Learning of Context-Free Grammar (Chen et al.): http://www.aclweb.org/anthology/O06-1004

+ Learning context-free grammars from structural data in polynomial time (Sakakibara, 1994): http://www.sciencedirect.com/science/article/pii/03043975909... (uses positive skeletons)

Nice overview: http://staff.icar.cnr.it/staff/ruffolo/public_html/progetti/...

YeGoblynQueenne · on June 10, 2017

Aw man! That 1994 paper by Sakakibara is a piece of history! It concludes by saying that it's "part of the work in the major R&D of the Fifth Generation Computer Project conducted under the program set up by MITI" [1]. Plus, Sakakibara is one of my grammar induction heroes :0

However- his algorithm learns CFGs from structural data, which is to say, derivation trees (think parse trees). So it's completely irrelevant to the example in the article, that attempts to learn a^nb^n from examples of its derivations -which remains impossible.

As to the other paper, by Chen, Tseng and Chen, that's about learning a CFG that reproduces the strings in a corpus- so learning a CFG of a corpus as opposed to the grammar of a context-free language (therefore, a context-free grammar) which, again, remains impossible.

_____________

[1] https://en.wikipedia.org/wiki/Fifth_generation_computer

avmich · on June 10, 2017

Also http://nbviewer.jupyter.org/url/norvig.com/ipython/xkcd1313.... , considering regexes are a subset of CFGs.

YeGoblynQueenne · on June 10, 2017

From a quick glance, the example doesn't quite learn regular grammars - rather, it reduces them to finite grammars (e.g. the finite grammar of all US presidential winners and losers) and learns (or, more accurately, invents) a regular expression for them.

Finite grammar induction from positive examples only is feasible in polynomial time, so Peter Norvig's notebook will not cause the fabric of the space-time continuum to be torn asunder, I am sure.

canjobear · on June 10, 2017

It's not clear that the learnability results about formal grammars are useful for describing what can be learned by practical algorithms, where the grammar does not have to be identified with perfect confidence.

As an example, it is possible to induce a counting algorithm from examples with high probability: https://colala.bcs.rochester.edu/papers/piantadosi2012bootst...

YeGoblynQueenne · on June 10, 2017

Counting is one thing and I'm not sure whether this is possible to do or not (my understanding is that it's "not"), but the example in the original article is specifically trying to learn to reproduce strings of exactly n a's followed by exactly n b's, for arbitrary positive n.

That's learning the CFG a^nb^n for n > 0 in a nutshell and is therefore impossible to learn, and indeed the LSTM in the article falters after test-time n exceeds training-time n by a few units.

The paper you link to discusses subject matter I'm not familiar with (bootsrapping as in psychology, say) so I'm not really qualified to opine on it. However, from a very quick look it seems like they're learning to associate counting words with numbers- sorry if I'm wrong about that. Still, that's not "learning to count", where you have to be able to make the inductive leap that n + 1 > n, for all n in your target set of numbers in addition to learning the recursive structure of number strings.

Finally, from the little I've seen of attempts to model counting with machine learning, there hasn't yet been much success, probably because it's impossible to learn a language with both structure and meaning from examples only of its structure and no access to their meaning.

mrplank · on June 10, 2017

LSTMs are on their retour in my opinion. They are a hack to make memory in recurrent networks more persistent. In practice they overfit too easy. They are being replaced with convolutional networks. Have a look at the latest paper from Facebook about translation for more details.

Drdrdrq · on June 10, 2017

The way I see it, the difference is that with CNN you have fixed maximum timeframe in which knowledge about world is preserved, while LSTMs and RNNs in general do not impose such restrictions. This makes them better suited for some applications.

If I am missing something please correct me.

mrdrozdov · on June 10, 2017

I think LSTMs are far from being replaced. They work too well and are too simple to use.

dirtyaura · on June 10, 2017

Really great work on visualizing neurons!

Is anyone working with LSTMs in a production setting? Any tips on what are the biggest challenges?

Jeremy Howard said in fast.ai course that in the applied setting, simpler GRUs work much better and has replaced LSTMs. Comments about this?

agibsonccc · on June 10, 2017

Yes the bulk of our business is time series. This includes everything from hardware break downs to fraud detection. I think Jeremy has some good points but in general, but I wouldn't assume that everything is binary. (By this, I mean look at these kinds of terse statements with a bit of nuance)

Usually as long as you have a high amount of regularization and use truncated backprop through time in training you can learn some fairly useful classification and forecasting problems.

Beyond that standard neural net tuning applies. Eg: normalize your data, pay attention to your weight initialization, understand what loss function you're using,..

griffinkelly · on June 10, 2017

So if LSTMs are purposely forgetting, do you need less training data than a CNN?

agibsonccc · on June 10, 2017

LSTMs don't "forget" more than "remember the things that matter". They don't necessarily need less data. They do have a limit on the "length" of time steps they can handle though. Eg: You can't do thousands in to the future (maybe a few hundred or so)

The long part of "LSTM" means remember good long ranging dependencies.

jph00 · on June 10, 2017

I hope that wasn't quite what I said :)

I said, IIRC, that they're very similar in terms of results and GRUs are a little simpler. I've seen some papers show better results for GRU, and visa versa.

jph00 · on June 10, 2017

Oh also I did say that CNNs often beat RNNs, even for time series, language, and audio. So I generally try them first. You can always stick an RNN on top of the CNN and get the best of both worlds.

rjtavares · on June 10, 2017

I just wanted to take the chance to say thank you for the course. As someone who is interested in the topic but isn't a practicioner, just watching it all work in practice has been fascinating and highly enlightening.

olavgg · on June 10, 2017

Our chatbot uses LSTM(GRU) for prediction, we see about the same performance as LSTM, but training is more efficient.

msamwald · on June 10, 2017

A recent paper exploring various variatons of neural machine translation architectures found that LSTMs consistently outperformed GRUs in that task.

https://arxiv.org/abs/1703.03906

tchalla · on June 10, 2017

> simpler GRUs

What does GRU stand for?

davik · on June 10, 2017

Gated recurrent unit, it has fewer gates than LSTMs usually.

minimaxir · on June 10, 2017

Is there code for the coloring of neurons per-character as in the post? I've seen that type of visualization on similar posts and am curious if there is a library for it. (the original char-rnn post [http://karpathy.github.io/2015/05/21/rnn-effectiveness/] indicates that it is custom Code/CSS/HTML)

trwoway · on June 10, 2017

From Karpathy's blog post : http://cs.stanford.edu/people/karpathy/viscode.zip

mrplank · on June 13, 2017

Google Brain outperforms LSTMs with Convolutional Networks in speed and accuracy, seeming to confirm LSTMs are not optimal for NLP at least:

https://arxiv.org/pdf/1706.03762.pdf

Seanny123 · on June 10, 2017

Is the code for generating the reactions from the LSTM hidden units posted anywhere? That was the best part for me and I'd love to use it in my own projects.

Joris1225 · on June 10, 2017

In Andrej Karpathy's excellent blogpost on RNNs[1], he links to some of the code he used for visualisation[2]. I have done something similar. In general: put your the activations of each cell memory over time through tanh and color it based on that.

[1] http://karpathy.github.io/2015/05/21/rnn-effectiveness/ [2] http://cs.stanford.edu/people/karpathy/viscode.zip

CyberDildonics · on June 11, 2017

> I once though LSTMs were tricky, but LSTMs are actually very easy ...

You would think an article like this would define LSTM somewhere.

natch · on June 10, 2017

LSTM is "Long Short Term Memory," since the tutorial never mentions what it stands for.

https://en.wikipedia.org/wiki/Long_short-term_memory

jeffjose · on June 10, 2017

It's defined few sections in, but yes, your larger point stands.

raarts · on June 10, 2017

Can someone provide a tl;dr ?

avaer · on June 10, 2017

Neural nets that update over time need a way to know what to keep and what to forget from last time. So let's learn what to keep and forget... by using more neural nets.

Disclaimer: I just learned what an LSTM was. But it's a good article.

sherjilozair · on June 10, 2017

This is a surprisingly accurate tl;dr for someone who just learnt what LSTMs are.

dnautics · on June 10, 2017

One more key point:. Memory is additive, not multiplicative.

SAI_Peregrinus · on June 10, 2017

And LSTM stands for Long/Short Term Memory, the definition is only found about a third of the way through the article.

thrw1001 · on June 17, 2017

Knowing what the actual units do, I prefer spelling out the acronym as Long Short-Term Memory. They implement short-term memory, like all RNNs do, with a slight improvement to make it long.