Hacker News new | past | comments | ask | show | jobs | submit login

This is fascinating.

But I'm very curious what the emotional "parameters" are? There are literally at least a thousand different ways of saying "I love you" (serious-romantic, throwaway to a buddy, reassuring a scared child, sarcastic, choking up, full of gratitude, irritated, self-questioning, dismissive, etc. ad finitum). Anyone who's worked as an actor and done script analysis knows there are 100's of variables that go into a line reading. Just three words, by themselves, can communicate roughly an entire paragraph's worth of meaning solely by the exact way they're said -- which is one of the things that makes acting, and directing actors, such a rewarding challenge.

Obviously it's far too complex to infer from text alone. So curious how the team has simplified it? What are emotional dimensions that you can specify? And how did they choose those dimensions over others? Are they geared towards the kind of "everyday" expression in a normal conversation between friends, or towards the more "dramatic" or "high comedy" of intense situations that much of film and TV lean towards?




I've always imagined that this tech would need a markup language. Instead of a script that an actor needs to interpret, the script writer (or an editor, or a translator) would mark up the text.


There is Speech Synthesis Markup Language (SSML). Amazon Polly and Google text-to-speech supports it, although the best neural-model based voices only support a small subset.


Ah thank you, that's very interesting.

So that's not markup along "emotional" lines, but rather along "technical" attributes such as speed, pitch, volume, pause between words, and so on.

Obviously coding those things in XML manually would be a nightmare. Now I find myself wondering if 1) these technical parameters can be used to synthesize speech that does sound like a reasonable approximation of emotion (or if they're insufficient because changes in resonance and timbre are crucial too), and 2) if there are tools that can translate, say, 100 different basic emotional descriptions ("excitedly curious", "depressed but making effort to show interest", etc.) into the appropriate technical parameters so it would be usable.

Anyways, just a fascinating area of study.


Festival TTS already implements that[0].

[0] https://www.cs.cmu.edu/~awb/festival_demos/sable.html


I hear same expression with different "strength". There is no play. No motion. Expression should change after response. It does not. There is no dialog. For me it sounds bald, boring. It'd better not to participate in such dialog.

We can express emotions without words:

xxx: Distress

yyy: Support

xxx: Hope

It maps on music and we have dictionary to describe it. The one I'm listening to is Sorrow and Hopeful - entire track. May be a good start. Write first (classification).

Examples you gave I feel live on same scale but extreme values. So even harder.

I'd imagine it work like autotune - enhance human input




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: