Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm torn.

On the one hand, the tech is impressive, and the demo is nicely done.

On the other, I think the demo completely misses the point. There's a disconnect between what learners need to learn and what this model optimises for, and it's probably largely explainable by how difficult (maybe even impossible) getting training datasets is. That, and marketing.

I believe most learners optimise for two things: being understood [1] and not being grating to the ear [2]. Both goals hinge on acquiring the right set of phonemes and phonetic "tools", because the sets of meaningfully distinct sounds (phonemes) and "tools" rarely match between languages.

For example, most (all?) Slavic languages have way fewer meaningfully distinct vowels than English. Meaningfully distinct is the crucial part. Russian word "молоко" as it's most often pronounced has three different vowels, at least two of which would be distinct to an English speaker, but Russian speakers hear that as one-ish vowel. And I mean "hear it": it's not a conscious phenomenon! Phoneme recognition is completely subconscious, so unless specifically trained, people often don't hear the difference between sounds that are obviously different to people who speak that language natively [3].

Same goes for phonetic "tools". English speakers shorten vowels when followed by non-voiced consonants, which makes "heart" and "hard" distinguishable even when t/d are transformed into the same sound (glottal stop or a tap). This "tool" is not available in many languages, so people use it incorrectly and it sounds confusing.

So, how would ML models learn this mapping between sounds and phonemes, especially when it's non-local (like with the preceding vowel length)? It's relatively straightforward to find large sets of speech samples labelled with their speakers' backgrounds, but that's just sounds, not phonemes! There is very little signal showing which sound structures matter for humans listening to the sound and which don't. [4]

There's also a set of moral issues connected to the "target accent" approach. Teaching learners to acquire an accent that superficially sounds like whatever they chose as a "target" devalues all other accents, that are just as valid and are just as English, because they have the same phonetic system (phonemes + "tools"). It can also make people sound a bit cringe, which I saw first hand.

Ideally learners should learn phonetic systems, not superficial accents. That's what makes speech intelligible and natural, even if it's has an exotic flavour [5][6]. Systems like the one the company is building do the opposite. I guess they are easier to build and easier to sell.

[1]: On that path lies a nice surprise: being understood and understanding are two sides of the same medal, so learning how to be understood a language learner inevitably starts to understand better. Being able to hear the full set of phonemes is the key to both.

[2]: There's a vast, VAST difference between people not paying attention to how someone speaks and them not being able to tell that something's off when prompted.

[3]: Nice demonstration for non-Hindi speakers: https://www.youtube.com/watch?v=-I7iUUp-cX8 When isolated and spoken slowly, the consonants might sound different, but in normal speech they sound practically indistinguishable to English speakers with no prior exposure. Native speakers would hear the difference as clear as you would in cap/cup!

[4]: Take their viral accent recognition demo. Anecdotally, among three non-native speakers with different backgrounds I talked to, the demo was guessing the mother tongue much better than native speakers, and it errors were different. This is a sign of the model learning to recognise the wrong things.

[5]: Ever noticed how films almost always cast native English speakers imitating non-English accents rather than people for whom that's their first language? That's why, English phonetic system with sprinkles of phonetic flavour is much more understandable.

[6]: By the way, Arnold Schwarzenegger understands this very well.



I don't believe that phonemes are real. They are a product of centuries long literacy, and the stereotypically "foreign" accent results from people being taught to construct the words from phonemes.

Not all languages can be neatly split into a nice set of phonemes - Danish phonology in particular seems mostly imaginary, and the "insane grammar" of Old Irish appears to result from the fact that word/morpheme boundaries can occur within the "phonemes".


> I don't believe that phonemes are real

I used to think the same until I came across the VQ-VAE paper [1], where the goal was to let the model learn the smallest units of speech without any supervision. Interestingly, the model ended up learning units that closely aligned with the phonemes we recognize today.

[1] - https://arxiv.org/pdf/1711.00937


The gibberish it produces shows that something quite different is being encoded.

It completely fails with Spanish. There is no reason why it would, if it used phonemes.

They think that the footnote explains the poor accuracy, but that is in fact how phonemes supposedly work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: