I don't know whether they have a specific French <--> Chinese model. They might,...

ogrisel · on Jan 15, 2015

New neural machine translation architectures are experimenting with pairs of neural encoders / decoders, one pair for each language and a shared language independent vector space for the meaning of all words:

http://arxiv.org/abs/1406.1078

http://arxiv.org/abs/1409.3215

http://arxiv.org/abs/1410.8206

So the total number of models is still linear with the number of languages.

I do not know whether this new generation of translation models is leveraged by the google translation app though.

Also pairs of languages for which their are big amount of parallel training data will still be favored.

rancur · on Jan 15, 2015

> Also pairs of languages for which their are big amount of parallel training data will still be favored.

wouldn't bible translations help?

ogrisel · on Jan 15, 2015

It might but:

- the vocabulary and topics covered in the bible is quite different from today's written and spoken text, especially phone discussions or social network messages.

- other aligned corpora such as http://www.statmt.org/europarl/ are much larger than the bible (several millions of tokens for most pairs vs less than 1 million for the Bible)

Agreed that http://www.statmt.org/europarl/ does not cover non-European languages.

dragonwriter · on Jan 15, 2015

> so MT systems usually back off to English as a pivot language

That's an interesting choice, because English lacks features some other languages might have, and thus you end up distorting through English. I remember considerable work from different sources a ways back toward a constructing artificial languages for this purpose so to mitigate the introduction of ambiguity by using an existing natural language as a pivot language, I'm surprised that natural language as the pivot is the state of the art (though I'm not surprised that English is the pivot language given that.)

syllogism · on Jan 15, 2015

It's just weight of research hours, and weight of data. Our English numbers are almost always better in NLP.