Hacker News new | past | comments | ask | show | jobs | submit login

Just the other day, I participated in a discussion about how "language identification" is a solved problem -- in fact, hasn't it been solved for a decade?

As anyone who's had to use langid in practice will testify, it's solved only as long as:

A) you want to identify 1 out of N languages (reality: a text can be in any language, outside your predefined set)

B) you assume the text is in exactly one language (reality: can be 0, can be multiple, as is common with globalized English phrases)

C) you don't need a measure of confidence (most algos give an all-or-nothing confidence score [0])

D) the text isn't too short (twitter), too noisy (repeated sections ala web pages, boilerplate), too regional/dialect, etc.

In other words, not solved at all.

In my experience, the same is true for any other ML task, once you want to use it in practice (as opposed to "write an article about").

The amount of work to get something actually working robustly is still non-trivial. In some respects, things have gotten worse over the past years due to a focus on cranking up the number of model parameters, at the expense of a decent error analysis and model interpretability.

[0] https://twitter.com/RadimRehurek/status/872280794152054784




I'm reminded of how often I see Twitter offer to translate English-language tweets containing a proper noun or two from absurdly unconnected languages. And that's with text containing mostly common and distinctive English words.


I'm pretty sure Twitter's langid uses character n-grams. You'll see a tweet that's plain English that happens to match n-gram statistics unusually well with some other language, which pushes the likelihood score to just above English. (I checked this by running an example or two through my own langid code.)

It shouldn't be hard to improve on by treating the score on a tweet as bayesian evidence to combine with a prior from preceding tweets.


In other words, we have some systems that are great at processing spherical cows in a vacuum...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: