Hi, we tried using a single model for the entire seq-to-seq task but the number of examples in PLAIN is huge which causes the model to perform worse on other classes. The reason we used XGBoost was to separate the two very different tasks (predicting whether a word in normalized; predicting the sequence of normalized tokens).
On the other hand, as mentioned when comparing text normalization systems it is more important to look at the exact kinds of errors made by the system (not only the overall accuracy). Our model showed improvement over the baseline model in https://arxiv.org/abs/1611.00068. DNC showed improvement in certain semiotic classes such as DATE, CARDINAL and TIME making zero unacceptable predictions in these classes, LSTM was susceptible to these kinds of mistakes even when a lot of training data was available. Yes, we do not use internal computation steps, the model replaces a standard LSTM in a seq-to-seq model with a DNC. However thanks for the suggestions it would be interesting to see the performance improvements if the internal computation steps are increased.
On the other hand, as mentioned when comparing text normalization systems it is more important to look at the exact kinds of errors made by the system (not only the overall accuracy). Our model showed improvement over the baseline model in https://arxiv.org/abs/1611.00068. DNC showed improvement in certain semiotic classes such as DATE, CARDINAL and TIME making zero unacceptable predictions in these classes, LSTM was susceptible to these kinds of mistakes even when a lot of training data was available. Yes, we do not use internal computation steps, the model replaces a standard LSTM in a seq-to-seq model with a DNC. However thanks for the suggestions it would be interesting to see the performance improvements if the internal computation steps are increased.