Hi thanks for a great paper. I have read it as well as the nvidia articles a lot...

kastnerkyle · on March 7, 2016

It is a dense read, but you might have a look at [1]. This is how attention is implemented in Theano. Basically the key is going "3D" per timestep (where 2D per timestep is the norm when doing minibatch training), then taking a weighted sum over the correct axis to get the right size to combine with the RNN state.

Short summary:

I: input length

M: minibatch size (same for input and output)

H_in: input hidden size (arbitrary/user selected)

H_out: output hidden size (arbitrary/user selected)

C: attention feature size (arbitrary/user selected)

Looking at the decode/generator RNN, "context" comes in at every timestep as every hidden state from the BiRNN (I, M, H_in) projected to (I, M, C). We do the normal RNN thing (tanh, LSTM, GRU) for the generator at first to get a decode hidden state (M, H_out).

Next the "first" output hidden state gets projected to the attention size C (so now (M, C)), and using numpy broadcasting this (1, M, C) size thing gets summed with the (I, M, C) size "context" which is a projection of the input RNN hiddens. Now we have something that is looking at both what the output RNN has previously done (and seen), and some context information from the input RNN.

A nonlinearity (tanh) is applied to the (I, M, C) sized piece, partly in order to bound the activation to (-1, 1) and partly just because we like nonlinearity. This (I, M, C) size thing then gets projected to (I, M, 1), or "alpha", then the useless dimension is dropped via reshape so we now have (I, M), then a softmax is applied over the first dimension I. This means that the network has effectively "weighted" each of the I timesteps of the input, though at the start of training it can't know what is relevant, at the end of training this is the thing you visualize.

This (I, M) is the actual "attention" - to apply it you simply take the original hidden states from the input BiRNN and use broadcasting to multiply again. (I, M, 1) * (I, M, H_in) gives a weighting over every timestep of the input, and finally we sum over the 0 axis (I) to get the final "context" of just (M, H_in) size. This can then be projected to a new size (H_out) and combined with the output hiddens to get an RNN that has "seen" a weighted sum of the input, so it can generate conditioned on the whole input sequence.

Note that this whole messy procedure I described happens per timesetep - so the neural network (given enough data) can learn what to look at in the whole input in order to generate the best output. Other attentions such as [2],[3] are constrained either to only move forward or to be a hybrid of forward movement and (pooled) global lookup. [4] is a good summary of the different possibilities in use today.

[1] https://github.com/kyunghyuncho/dl4mt-material/blob/master/s...

[2] http://arxiv.org/abs/1308.0850

[3] http://arxiv.org/abs/1508.04395

[4] PPT, http://www.thespermwhale.com/jaseweston/ram/slides/session2/...

neurangotan · on March 7, 2016

Thanks. That was the info I was looking for. Still a lot to digest but it is going to be really helpful.

evc123 · on March 7, 2016

usually zero padding is used; a max_input_length is set somewhere in the code and a number of zeros equal to (max_input_ length - actual_number_of_words_in_input) is appended to the array of input word_ids so that all input sentences have the same length.

kastnerkyle · on March 7, 2016

This depends on implementation (unrolled RNNs vs true recurrence). Each minibatch needs to be the same length, but that is it. And that is even implementation dependent - if your core RNN had a special symbol for "EOS" it could always handle it in another way.

Normally you pad each minibatch to the same length (length of the longest sequence in that minibatch), then carry around an additional mask to zero out any "unnecessary" results from padding for the shorter sequences.

The BiRNN (using all hidden states) + attention mechanism is the thing that allows variable length context to be fed to the generative decode RNN. A regular RNN (using all hidden states) + attention, or even just the last hidden state of an RNN can all be used to map variable length to fixed length sequences in order to condition the output generator.

You will note that padding to the length of the longest sequence in a minibatch wastes computation - people often sort and shuffle the input so that sequences of approximately the same length are used in each minibatch, to maximize computation. If you padded to the overall longest sequence (rather than per minibatch), you would pay a massive overhead computationally.

neurangotan · on March 7, 2016

Guys thanks both for the replies. I still do not understand the inputs and outputs of the attention mechanism neural network.

The paper says

eij = a(si−1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si−1 (just before emitting yi) and the j-th annotation hj of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system.

So a is a feed fw nn. What is the input to this nn ? Do we input each hj corresponding to each word of the input sentence separately and get one number for each word ? If the input sentence has 200 words do I run this 200 times ? One for each input word ?