> *It is possible to ask for sensitive data such as the user’s password from any...

pimlottc · on Oct 21, 2019

I'll take a stab at it. What's different here is that an audio-only UI makes it particularly hard for the user to know what program they're interacting with.

Visual UIs generally offer a host of cues to indicate what program is running, and take special efforts to make security-sensitive interactions and dialogs hard to fake. Using these techniques in a voice UI is tricky. There's no good way to tell where the last output came from, or where the next input is going. How can a user be certain that a request for privileged information is coming from a trusted source? In this example, Google clearly tried to create a signature sound (the "Bye earcon") that lets the user know when an app has exited, but an app was able to fake it. The attack leverages the user's trust that was built up by Google.

I think this article provides a useful example that highlights the particular difficulties securing a voice UI system from phishing attacks.

jen_h · on Oct 21, 2019

It's not really any different -- changing server behavior after app review is possible for any server-based app. The one thing they exploited that is unique, though, is that when a user talks to a smart device, they generally don't know at any given time if their commands are going to Google/Amazon or are being passed to Google/Amazon and the third-party developer.

As far as many users are concerned, they're talking to Alexa. The third party app is Alexa, too.

And because of the opaque single-dimensional nature of voice interfaces, even a savvy user doesn't know who's really receiving their intent -- there are enough glitches where you think you're sending to the active skill, but you're back in Alexa's lobby again, so the inverse case the researchers are playing with is a good vector.

I think they could solve some of this because Amazon/Google are gatekeepers -- they get user input no matter where it goes -- they could easily automate detecting anomalous user input and flag for review (that would of course miss the first victims, but it's better than nothing).

I think the "Who's listening?" part is a little harder to solve. Maybe by forcing the third party app to always announce itself as itself? But that does add some friction to the "experience" they want to provide...however, a little friction is better if it means protecting your users.

jen_h · on Oct 21, 2019

Okay, just thought of something here — force third party apps to use a different voice.

Developers (on Alexa, at least) can optionally do this now with SSML, but making it a requirement would be an audio cue to users that the “actor” has changed — without adding any delay to the interaction.

jka · on Oct 20, 2019

A decent chunk of computer security work seems to be around finding these same exploits in new devices, since we haven't convinced ourselves to stop reinventing and selling new connectivity devices :)

i.e. - it's not a new technique, but a new instance of the problem, and that makes it worthwhile (especially for something widely used in private environments) to explore and expose.

It'd be nice if we could reach some kind of device/phone capability plateau and reduce consumption of new equipment. And ideally settle on a small set of software to use on those, which could be hardened and made reliable over time.

Until then, ...

imglorp · on Oct 20, 2019

It's not, just another channel with its own inherent rate of success, maybe around non-technical users who might mistakenly trust the speaker but not a person on the phone.

Back in 1966, the makers of the Eliza AI chatbot program were shocked to learn people inherently trusted the program and told it things they didn't want other people to hear. So I propose vishing capitalizes on this phenomena.

https://en.wikipedia.org/wiki/ELIZA