It is a dense read, but you might have a look at [1]. This is how attention is i...

It is a dense read, but you might have a look at [1]. This is how attention is implemented in Theano. Basically the key is going "3D" per timestep (where 2D per timestep is the norm when doing minibatch training), then taking a weighted sum over the correct axis to get the right size to combine with the RNN state.

Short summary:

I: input length

M: minibatch size (same for input and output)

H_in: input hidden size (arbitrary/user selected)

H_out: output hidden size (arbitrary/user selected)

C: attention feature size (arbitrary/user selected)

Looking at the decode/generator RNN, "context" comes in at every timestep as every hidden state from the BiRNN (I, M, H_in) projected to (I, M, C). We do the normal RNN thing (tanh, LSTM, GRU) for the generator at first to get a decode hidden state (M, H_out).

Next the "first" output hidden state gets projected to the attention size C (so now (M, C)), and using numpy broadcasting this (1, M, C) size thing gets summed with the (I, M, C) size "context" which is a projection of the input RNN hiddens. Now we have something that is looking at both what the output RNN has previously done (and seen), and some context information from the input RNN.

A nonlinearity (tanh) is applied to the (I, M, C) sized piece, partly in order to bound the activation to (-1, 1) and partly just because we like nonlinearity. This (I, M, C) size thing then gets projected to (I, M, 1), or "alpha", then the useless dimension is dropped via reshape so we now have (I, M), then a softmax is applied over the first dimension I. This means that the network has effectively "weighted" each of the I timesteps of the input, though at the start of training it can't know what is relevant, at the end of training this is the thing you visualize.

This (I, M) is the actual "attention" - to apply it you simply take the original hidden states from the input BiRNN and use broadcasting to multiply again. (I, M, 1) * (I, M, H_in) gives a weighting over every timestep of the input, and finally we sum over the 0 axis (I) to get the final "context" of just (M, H_in) size. This can then be projected to a new size (H_out) and combined with the output hiddens to get an RNN that has "seen" a weighted sum of the input, so it can generate conditioned on the whole input sequence.

Note that this whole messy procedure I described happens per timesetep - so the neural network (given enough data) can learn what to look at in the whole input in order to generate the best output. Other attentions such as [2],[3] are constrained either to only move forward or to be a hybrid of forward movement and (pooled) global lookup. [4] is a good summary of the different possibilities in use today.

[1] https://github.com/kyunghyuncho/dl4mt-material/blob/master/s...

[2] http://arxiv.org/abs/1308.0850

[3] http://arxiv.org/abs/1508.04395

[4] PPT, http://www.thespermwhale.com/jaseweston/ram/slides/session2/...