usually zero padding is used; a max_input_length is set somewhere in the code and a number of zeros equal to (max_input_ length - actual_number_of_words_in_input) is appended to the array of input word_ids so that all input sentences have the same length.
This depends on implementation (unrolled RNNs vs true recurrence). Each minibatch needs to be the same length, but that is it. And that is even implementation dependent - if your core RNN had a special symbol for "EOS" it could always handle it in another way.
Normally you pad each minibatch to the same length (length of the longest sequence in that minibatch), then carry around an additional mask to zero out any "unnecessary" results from padding for the shorter sequences.
The BiRNN (using all hidden states) + attention mechanism is the thing that allows variable length context to be fed to the generative decode RNN. A regular RNN (using all hidden states) + attention, or even just the last hidden state of an RNN can all be used to map variable length to fixed length sequences in order to condition the output generator.
You will note that padding to the length of the longest sequence in a minibatch wastes computation - people often sort and shuffle the input so that sequences of approximately the same length are used in each minibatch, to maximize computation. If you padded to the overall longest sequence (rather than per minibatch), you would pay a massive overhead computationally.
Guys thanks both for the replies. I still do not understand the inputs and outputs of the attention mechanism neural network.
The paper says
eij = a(si−1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si−1 (just before emitting yi) and the j-th annotation hj of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained with
all the other components of the proposed system.
So a is a feed fw nn. What is the input to this nn ? Do we input each hj corresponding to each word of the input sentence separately and get one number for each word ? If the input sentence has 200 words do I run this 200 times ? One for each input word ?