The thing is this: the first generation of machine translation system were hard ...

drostie · on June 9, 2012

I agree that syntax rules are themselves also a huge obstacle, and mostly for the same reason: we often teach languages by saying "here are some valid grammars for you to use in sentence construction," it's not hard to declaratively specify, but computers don't have this universal translation I describe from "what the solution (may) look like" to "how to find the solution."

But still, the reason that I chose this particular example is because if you used the entire Internet, even with the new translators that are sensitive to semantic nuance, you still get the problem of "which do I trust? the statistics of knives-are-amoral or the statistics of helping-humans-is-moral?" We can see that the amoral aspect of knifehood "wins out" in almost every circumstance, but I don't expect even the entire internet to contain the relevant sentence often enough for the statistics to know that, unless the machine does something special to approximate the underlying understanding of the semantics.

ma2rten · on June 9, 2012

Machine Translation is an application of Machine Learning which is a subfield of Artificial Intelligence which is a subfield of Computer Science. However, what often separates AI from the rest of CS, is that AI often tries to solve NP complete problems, which are not possible to solve exactly. In AI it's enough to solve the problem in 99.9999% of the cases (and sometimes even only in 80% of the cases).

In other CS problems it is often a good approach to come up with an algorithm and then think of an example why your algorithm or approach will not work. You can keep doing this until you come up with an algorithm for which you can not come up with a counter example.

In AI this approach does not work, because there simply is no exact solution. What you do instead is find the simplest solution which works for your problem (for some measure of best which can be measured by a machine), you find the biggest source of errors, improve your solution and keep iterating.

So after this general introduction, my points:

1. Yes, there are semantic ambiguities that can only be resolved with real understanding, but the real question should be how common they are.

2. The idea of statistical machine translation is not find specific rules or heuristics, but to find mathematical models that make a trade-off between computational complexity, the effects they can model and the amount of data needed to get good results (this is known as the bias/variance trade-off in machine learning and statistics). Also you want models that generalize to different languages and different domains.

3. It will probability surprise you that are algorithms that can understand semantics of language already to some degree [1][2]. So you should not see it like you understand the text or you don't but it's more like a scale. However these methods are not applied in the context of machine translation yet, simply because it's pointless before you get the syntax part right.

[1] for some wicket interpretation of understanding

[2] e.g. http://en.wikipedia.org/wiki/Latent_semantic_analysis