Voice and the uncanny valley of AI

pesenti · on June 24, 2017

Having developed and sold virtual agents for Watson, this is one of the major issue I faced. I had to make it clear to our customers (and especially to the sales and marketing teams) that agents only work in a narrow domain, whose boundaries would need to be clear to the user. But often that got lost in translation and expectations were never met (especially in the early days of Watson when our PR was out of control).

This said, it's not as limited as the author makes it. There are classes of open requests that can be handled directly from content (i.e. not like an expert system or IVR but more like a very smart NL search) without having to identify a precise intent. The Jeopardy system was a perfect example of that, and Google is now able to answer many question types (not just factoids but also questions with long answers like how-to) directly from the content.

Dialog is the real limiting factor. Right now most dialog in production is scripted, with some smart feature like slot filling, but still much closer to an expert system than a statistical example based system.

ghaff · on June 24, 2017

I'm not sure uncanny valley is the right term for this though I see the analogy.

Rather, I think it's yet another case where success (or at least impressive advances) in a narrow domain gets conflated with something much broader. Because if a human can do A they're clearly at least on the cusp of doing B and C too m

In the case of our digital personal assistants they've actually gotten pretty good at voice recognition, at least given certain parameters of accents etc. But what that means is that they're good at recognizing appropriate wizard incantations and taking the corresponding action. Get off script? It's worse than talking to one of those outsourced call centers we all hate.

bsenftner · on June 24, 2017

I'd say he is spot on with the analogy. He's describing a situation where the actual solution to the problem at hand (a quality Voice experience) is just the opening of a door to a much more complex charade than originally understood. I spent a decade becoming an expert at creating digital doubles of people - where the term "uncanny valley" originates. In the 3D-graphics-person situation it is very similar. Once one develops a method of generating a likeness of someone, it's not right because the model does not have the real person's hair; then after creating their hair, the human model needs clothing models of the style that person would actually wear. But it is still not right because the character does not exhibit the specific facial expressions the real person uses that are characteristic of their personality. After that is their characteristic body movements, how they stand, sit, how they idle. This is very similar to the issue with quality Voice - it is not just creating a Voice, just like it is not just creating a 3D model, but it is in fact creating a model of reality the software holds, which is then query-capable by external humans and the software model itself as that self-conversation within the software is necessary to fulfill the simulation, and finally generate an experience piercing the uncanny valley for both voice and avatars. But it is going to require 2-200 times more computational capacity than we are throwing around now. The devil is in the details.

ThomPete · on June 24, 2017

Uncanny Valley is about knowing somethings is off even though it's almost impossible to see what and feeling eerie about it.

I don't see the analogy here and as a user of Google home with my entire family it's never been anything closely to what uncanny valley is about.

whynaut · on June 25, 2017

There is another usage I've seen in AI discussions involving the uncertainty of the gap between current tech and general AI, and the chance we may cross it accidentally.

ghaff · on June 24, 2017

I don't really disagree but the argument in the piece isn't so much around creating quality digital voice--which is its own problem. But around "understanding" verbal requests in a broad context.

johnchristopher · on June 24, 2017

> In the case of our digital personal assistants they've actually gotten pretty good at voice recognition

Maybe so but it has been months now since the last time my smartphone detected my screaming "OK Google" at it. No reboot, resets or recording sessions with different tones could fix it.

Admittedly, I am singling out one app on one phone and I don't know how Alexa, Sir and others perform but I remain unconvinced by all those AI and the current assistant tech.

ghaff · on June 24, 2017

The microphones on the home devices definitely help. But we're getting there in voice recognition. The NLP is a lot tougher and may be further out than a lot of people assume.

johnchristopher · on June 24, 2017

To be honest it's only the detecting of "OK Google" that is broken. Manual activation of speech recognition do work as intended or at least as expected.

ghaff · on June 25, 2017

Are you sure there isn't a setting someplace? Also I believe in the case of an iPhone it needs to be plugged in.

johnchristopher · on June 25, 2017

Maybe. There are some settings related to google maps and driving mode but frankly... I stopped looking. If it's there and half hours here and there of going through menus didn't make it I consider it broken through bad UI.

ballenf · on June 24, 2017

You wouldn't happen to be the actor in that Burger King commercial would you?

falcolas · on June 24, 2017

IVR - mentioned but not described in the article. It stands for interactive voice response, which is the industry term for phone trees of all things.

Interesting tidbit which could have benefited from fleshing out in the article, especially since it so accurately describes the current state of voice assistants.

viewtransform · on June 24, 2017

I remember when voice recognition was barely usable (90-00's) then all of a sudden within the last decade voice recognition started showing up and it was amazingly good. Siri can understand most of what I speak to it even while driving with music in the background.

What are the successful algorithms that made this leap forward possible ? HMM's ? Neural networks ? more data/compute ? What changed since Dragon NaturallySpeaking in 1997? Can anyone recommend overview papers/blogs on this topic I could use to get up to speed ?