The model was trained on video classification, image qa and image captioning. Video captioning and video qa is not trained, yet the model shows results on those tasks.
As an author of this paper, I feel such neural networks are indeed a small step towards what we call AGI. By learning shared representations across a variety of tasks makes an AI system more robust to real-world datasets and makes it easy to adapt to new tasks without having to learn everything from scratch (which we humans are naturally capable of doing)
That's a really interesting idea! Giving such neural networks access to more and more modalities helps in useful transfer of information across the various nodes of the network, making it capable of performing zero-shot learning on never before trained tasks. It also makes it easy to add new tasks to the AI system even when data availability is a problem.
Thank you for the review! We will surely correct these mistakes before submitting for final publication.
The memory requirements of DNC is quite high. We used GTX 1060 for training. Increasing the context window anything more than 3 increases the sequence length by a huge amount, causing memory problems. However, we also found that DNC works quite well even on small batch size. We used a batch size of 16 for all our experiments. The training time for a batch size of 16, context window of size 3 and 200k steps is 48h on a GTX 1060 system.
The demo code used for test results in the paper are available in https://github.com/cognibit/Text-Normalization-Demo. The model implementation is located in src/lib/seq2seq.py. We did not code the DNC cell from scratch, the official implementation was used.
The model specifications used for the Kaggle competition was a lot different than the one mentioned in the paper. The paper compares on the same test set used by https://arxiv.org/abs/1611.00068. DNC showed significant improvement over LSTM as a recurrent unit of a seq-to-seq model with almost zero unacceptable mistakes in certain semiotic classes. LSTM, on the other hand, is susceptible to these kinds of mistakes even when a lot of data is available.
I'm sorry for the misunderstanding. The reason we added the sentence because the model used in the competition was also based on DNC. But, changes were made when writing the paper, for instance, we did not use any attention mechanism at the seq-to-seq level in the competition. Besides, the paper concentrates more on comparing the kinds of errors made by the DNC network (avoiding unacceptable mistakes; not the overall accuracy), which shows an improvement over the LSTM model in the paper (https://arxiv.org/abs/1611.00068). On the other hand, overall accuracy was more important for the Kaggle competition.
We modified the sentence to say, "An earlier version of the approach used here has secured the 6th position in the Kaggle Russian Text Normalization Challenge by Google's Text Normalization Research Group".
Hi, we tried using a single model for the entire seq-to-seq task but the number of examples in PLAIN is huge which causes the model to perform worse on other classes. The reason we used XGBoost was to separate the two very different tasks (predicting whether a word in normalized; predicting the sequence of normalized tokens).
On the other hand, as mentioned when comparing text normalization systems it is more important to look at the exact kinds of errors made by the system (not only the overall accuracy). Our model showed improvement over the baseline model in https://arxiv.org/abs/1611.00068. DNC showed improvement in certain semiotic classes such as DATE, CARDINAL and TIME making zero unacceptable predictions in these classes, LSTM was susceptible to these kinds of mistakes even when a lot of training data was available. Yes, we do not use internal computation steps, the model replaces a standard LSTM in a seq-to-seq model with a DNC. However thanks for the suggestions it would be interesting to see the performance improvements if the internal computation steps are increased.