Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The thing is this: the first generation of machine translation system were hard coded systems that translated a string in one language into a string in another language using a set of hard coded rules. These systems were bad at the sort of semantic ambiguities that you describe. Also they tended to give too literal translations.

The new generation like Google Translate are statistical systems. Semantic ambiguities actually are fairly easy to resolve using statistics. The basic idea is this: Google can use the entire Internet to check if "Knives were ethical" or "Knives were efficient" is a more common thing to say. Also Google tries to translate the largest possible phrase, so if there is already a translation of "Knives were good" in their corpus then the problem will never occur.

Of cause there will always cases where this fails, but you'd be surprised how well it works. At least that is not the main concern at the moment. Maybe ultimately you could have a UI where a human selects the best translation. The actual problem with these systems is that they don't even have an understanding of syntax. With Google translate often entire parts of a sentence get lost.



I agree that syntax rules are themselves also a huge obstacle, and mostly for the same reason: we often teach languages by saying "here are some valid grammars for you to use in sentence construction," it's not hard to declaratively specify, but computers don't have this universal translation I describe from "what the solution (may) look like" to "how to find the solution."

But still, the reason that I chose this particular example is because if you used the entire Internet, even with the new translators that are sensitive to semantic nuance, you still get the problem of "which do I trust? the statistics of knives-are-amoral or the statistics of helping-humans-is-moral?" We can see that the amoral aspect of knifehood "wins out" in almost every circumstance, but I don't expect even the entire internet to contain the relevant sentence often enough for the statistics to know that, unless the machine does something special to approximate the underlying understanding of the semantics.


Machine Translation is an application of Machine Learning which is a subfield of Artificial Intelligence which is a subfield of Computer Science. However, what often separates AI from the rest of CS, is that AI often tries to solve NP complete problems, which are not possible to solve exactly. In AI it's enough to solve the problem in 99.9999% of the cases (and sometimes even only in 80% of the cases).

In other CS problems it is often a good approach to come up with an algorithm and then think of an example why your algorithm or approach will not work. You can keep doing this until you come up with an algorithm for which you can not come up with a counter example.

In AI this approach does not work, because there simply is no exact solution. What you do instead is find the simplest solution which works for your problem (for some measure of best which can be measured by a machine), you find the biggest source of errors, improve your solution and keep iterating.

So after this general introduction, my points:

1. Yes, there are semantic ambiguities that can only be resolved with real understanding, but the real question should be how common they are.

2. The idea of statistical machine translation is not find specific rules or heuristics, but to find mathematical models that make a trade-off between computational complexity, the effects they can model and the amount of data needed to get good results (this is known as the bias/variance trade-off in machine learning and statistics). Also you want models that generalize to different languages and different domains.

3. It will probability surprise you that are algorithms that can understand semantics of language already to some degree [1][2]. So you should not see it like you understand the text or you don't but it's more like a scale. However these methods are not applied in the context of machine translation yet, simply because it's pointless before you get the syntax part right.

[1] for some wicket interpretation of understanding

[2] e.g. http://en.wikipedia.org/wiki/Latent_semantic_analysis




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: