Hacker News new | past | comments | ask | show | jobs | submit login
Alexa and Google Home expose users to vishing and eavesdropping (srlabs.de)
178 points by kerm1t on Oct 20, 2019 | hide | past | favorite | 62 comments



If Amazon's Ring is partnering with LE[1], and from what I understand, in some circumstances providing access to customer-produced data even when customers refuse requests, it doesn't seem too unreasonable to have suspicions.

1. https://www.vice.com/en_us/article/43kga3/amazon-is-coaching...

2. https://www.eff.org/deeplinks/2019/08/five-concerns-about-am...

Edit to append link and quote:

Quote: However, he noted, there is a workaround if a resident happens to reject a police request. If the community member doesn’t want to supply a Ring video that seems vital to a local law enforcement investigation, police can contact Amazon, which will then essentially “subpoena” the video.

Link: https://www.govtech.com/security/Amazons-Ring-Video-Camera-A...


Subpoena is the wrong word. Amazon gives up your video willingly to law enforcement without your consent or even knowledge.

They can do this because it hasn’t yet been determined unlawful.

We are in a dire need of cyber ethics framework that enshrines user privacy.


What's the opt-in process like - specifically, what mechanisms are in place to protect those for which English might not be a first language?


That seems like a strange workaround. Wouldn't the burden of proof to subpoena to Amazon be the same as a warrant to the user?


This isn't an actual subpoena - a subpoena is an order issued by a court, and requires probable cause. What's happening here is that the cops are asking Amazon for the data, and Amazon is giving it to them, without being legally obligated to do so. Assuming that the user's terms of service say that Amazon can do this, they don't need your permission to do so, or any proof that a crime has occurred.

I assume it's legal for Amazon to give them these videos, since the images they're asking for are things that the cops or anyone else could have seen happening outside your house if they would have been driving by (or that your neighbors could have told the cops). There's no legal expectation of privacy, as there would be inside your house.


Amazon owns the video, not the user, so you only need a warrant if they demand one. They’re also free to not require one.


That makes sense, although it's not a subpoena in that case. It's just Amazon voluntarily cooperating. However, I'm not a fan of such voluntary cooperation. I think a company's default response should be, "We'll help you in every way possible once there is a warrant."

I mean, something like this should be viewed in the context of comparable IRL vendors. If I rent a 3rd-party storage unit from U-Haul or similar, a warrant is generally required.

(one exception I found was a case where police, on-site, witnessed a drug deal. They then used the defendant's key to open their unit without a warrant. It was judged lawful, that finding drugs and keycard on the defendant was sufficient probably cause. That makes sense, given that if police witness you in front of your house, or car, etc selling drugs, that would be sufficient as well to search.) [0]

https://www.govinfo.gov/content/pkg/USCOURTS-ilnd-1_14-cr-00...


> It is possible to ask for sensitive data such as the user’s password from any voice app.

Newsflash: computing device with the capability for user interaction can request information that you might not want to give it.

In other words, how is this situation different from any software running on any other type of computing device?


I'll take a stab at it. What's different here is that an audio-only UI makes it particularly hard for the user to know what program they're interacting with.

Visual UIs generally offer a host of cues to indicate what program is running, and take special efforts to make security-sensitive interactions and dialogs hard to fake. Using these techniques in a voice UI is tricky. There's no good way to tell where the last output came from, or where the next input is going. How can a user be certain that a request for privileged information is coming from a trusted source? In this example, Google clearly tried to create a signature sound (the "Bye earcon") that lets the user know when an app has exited, but an app was able to fake it. The attack leverages the user's trust that was built up by Google.

I think this article provides a useful example that highlights the particular difficulties securing a voice UI system from phishing attacks.


It's not really any different -- changing server behavior after app review is possible for any server-based app. The one thing they exploited that is unique, though, is that when a user talks to a smart device, they generally don't know at any given time if their commands are going to Google/Amazon or are being passed to Google/Amazon and the third-party developer.

As far as many users are concerned, they're talking to Alexa. The third party app is Alexa, too.

And because of the opaque single-dimensional nature of voice interfaces, even a savvy user doesn't know who's really receiving their intent -- there are enough glitches where you think you're sending to the active skill, but you're back in Alexa's lobby again, so the inverse case the researchers are playing with is a good vector.

I think they could solve some of this because Amazon/Google are gatekeepers -- they get user input no matter where it goes -- they could easily automate detecting anomalous user input and flag for review (that would of course miss the first victims, but it's better than nothing).

I think the "Who's listening?" part is a little harder to solve. Maybe by forcing the third party app to always announce itself as itself? But that does add some friction to the "experience" they want to provide...however, a little friction is better if it means protecting your users.


Okay, just thought of something here — force third party apps to use a different voice.

Developers (on Alexa, at least) can optionally do this now with SSML, but making it a requirement would be an audio cue to users that the “actor” has changed — without adding any delay to the interaction.


A decent chunk of computer security work seems to be around finding these same exploits in new devices, since we haven't convinced ourselves to stop reinventing and selling new connectivity devices :)

i.e. - it's not a new technique, but a new instance of the problem, and that makes it worthwhile (especially for something widely used in private environments) to explore and expose.

It'd be nice if we could reach some kind of device/phone capability plateau and reduce consumption of new equipment. And ideally settle on a small set of software to use on those, which could be hardened and made reliable over time.

Until then, ...


It's not, just another channel with its own inherent rate of success, maybe around non-technical users who might mistakenly trust the speaker but not a person on the phone.

Back in 1966, the makers of the Eliza AI chatbot program were shocked to learn people inherently trusted the program and told it things they didn't want other people to hear. So I propose vishing capitalizes on this phenomena.

https://en.wikipedia.org/wiki/ELIZA


> Amazon or Google review the security of the voice app before it is published. We change the functionality after this review, which does not prompt a second round review

How is this not a massive red flag?


You could do the same thing with almost any type of app where you run your own web service; it's kind of a black box to the app market's test group...you could change everything but the domain name and security certificate (and most app stores don't pin the cert as they expire and you wouldn't want to recertify at every cert switch).

One thing Amazon or Google could do here for voice apps, though, that Apple and Google (Android) can't for standard phone apps, is audit voice responses for anomalies or user input that matches a suspicous pattern and flag apps that trigger it.

They can do this because every utterance a user sends a voice assistant passes through Amazon or Google systems. If an app has access to user PII, they could add some automation to flag suspicious user responses or anomalous activity that differs from x days previous and pass it up the chain for review.

One thing I do like about Alexa development is that if you, as a developer, are privacy-minded (and don't need nor want user data for anything), you can protect your users by configuring your apps not to collect any of your users' info . As a developer, you don't even get IP addresses as everything goes User > Amazon > Developer > Amazon > User.

You always get session and Amazon-assigned user id, but they're typically pretty anonymous unless the user says "I am Jane Doe" -- which, I guess we should be honest, probably does happen more than it should, and this is what the OP researchers are exploiting.


I don't know for sure about Google, but with Alexa skills are simply implemented as a web service. There is no way for Amazon to know that you've deployed new code on your web service. There are a lot of limits to what you can change though, the prompts / intents are specified in a manifest you have to upload. But what the device does for an existing intent, and the responses it sends, can be changed without their knowledge.


I just can't believe that people actually pay to have these things in their houses. Or maybe, sadly, I can.


A smartphone has a much larger attack surface and far more snooping capability.


A lot of people don't intentionally, Echo's get bundled with some BT internet packages in the UK.


It probably did prompt the second round review, but someone in head office is going too many reviews could lead to a congress led inquiry, so they stay quiet until a a contractor spills the beans on the nastiness happening behind the scenes.


TIL "vishing" is a word.

If like me you were wondering what it meant:

"Vishing is the telephone equivalent of phishing. It is described as the act of using the telephone in an attempt to scam the user into surrendering private information that will be used for identity theft."


This makes no sense. Phishing comes from phreak + fishing, but the "ph" in phreak is already from the word phone (phone + freak) -- so the ph in "phishing" already comes from the word phone! The telephone version of phishing should be... "phishing."

But thanks for the explanation.


Disclaimer: I'm not a linguist, and this word was new to me. FWIW, I inferred something like "v for voice interface". IMHO the "ph" in "phishing", despite the etymology, has lost any meaningful semantic connection to "phone" per se. So this new term might seem redundant or circular in its derivation, while still being a valid / useful addition to the lexicon. (shrug)


It is sometimes referred to as 'vishing - a portmanteau of "voice" and phishing.

https://en.wikipedia.org/wiki/Voice_phishing


vishing is defined as using social engineering with the intention to get access to the user's vi session. The article is obviously using the term incorrectly.


I'm only now just wondering why (apparently - only looked at the first few search results) there isn't a vish shell....


Unlike EMACS, (neo)vi(m) is a text editor, not an OS.


This battle technique was perfected by the old Norse Vi Kings.


Do these devices record all the time or only after the trigger word (they would need to be always listening for the trigger word) until the end of the statement?


The device has to record all the time in order to "listen" for the wake word.

It's got a small couple-second buffer (enough to store "Amazon" or "Computer" or "Alexa" or "Echo") where it takes what it hears and compares it with its internal model for a match.

If there's no match, the buffer is overwritten with the next bit of noise. Once the device gets a wake word match, it transmits the statement that follows to home base to transcribe and handle.


From what I understand, it's slightly worse - the local matching is quite promiscuous in that it will be likely to trigger on false positives and then forward it to the remote backend where the actual match is confirmed, where sent data includes the entire buffer including a couple of seconds before and after.


You don’t really even need real false positives...case in point: My grandma has trouble saying “Alexa,” so we set it to “Amazon.” She listens to the news 24-7...& Amazon’s always all over the news...so there are a lot of stored news snippets on our account.


They record only after the trigger word. It's the same as Android phones and iPhones that have "OK Google" or "Hey Siri" enabled.

In practice we know that trigger words for all these devices occasionally misfire (There was also an issue with one type of google home device a while ago which was shipped with a faulty physical button that caused it to be turned on at intervals as if the user had pressed the physical button to start speaking).


The more important question is: how do we know whether these devices (or a particular subset of them) record all the time or only after the trigger word?


Regardless of intent, we know that all three major voice assistant services (Siri, Alexa, Google Assistant) experience false positives and end up accidentally recording conversations when the device thinks the trigger word was spoken, but actually was not.


By

a) viewing what they store via their log tools (though this isn't guaranteed to show everything, ie if they are recording everything they couldhide)

b) monitoring outbound network connections


Neither of those things are indicative. Secretly recorded things could be hidden from logs and bundle up recordings with normal voice queries on the network calls.


... while also considering the possibility of faulty software updates, bugs, and network attackers -- in an environment where hardware, network protocols, and APIs are proprietary and inscrutable.

And would we know if they had been recording unnecessarily?


Well they don't have big hard drives, so you can be confident they're not recording everything to disk that then could be unintentionally accessed or sent out later.

And you can look at network traffic (e.g. from wifi router stats) to be pretty confident they're not constantly live-streaming audio up to the cloud.

Of course most people will not actually do this monitoring themselves, but there are enough of these devices out there that if a significant number started recording constantly somebody would notice pretty quickly. And that would be terrible PR for the company involved, so I think google and amazon and apple have a pretty strong incentive not to do this.


How much of a hard drive would they need to do speech-to-text and upload that periodically in with other legitimate traffic?

The PR angle isn't that reassuring to me either, they've already absorbed some pretty bad PR hits on these devices and they're still going strong.


I'd imagine using something like Wireshark to see how much data it transmits at any given time would be a good start.


Im not a fan of devices actively listening and sending snippets (if not all the info) back to home base.

My simple solution is to not use these kind of devices.


All hardware with a microphone (or speaker since it too can be used as a mic) needs a hardware switch to disable ... Which will only land once open hardware Linux mobiles take off in next year or two ... Until then I just assume nothing is private


The Google Home Mini has a physical hardware mic mute switch. Other Google and Amazon smart devices have similar switches.


You don't need linux support to put a switch on a microphone


True, but none of the players that be will add one voluntarily.


The Google Home mini and Google Home Max have physical sliding mic mute toggles. The original Google Home has a physical momentary mic mute switch (click to disable, click to enable). The Amazon Echo also has a similar mic mute switch.


> momentary mic mute switch (click to disable, click to enable).

Momentary switches only activate when pushed and as son as their not pushed deäctivate.

Was it a momentary switch or a normal one (push to activate, push to deäctivate)?


Momentary, in that the actual mute function is software-controlled. However, on the Amazon device it does trigger a physical power-down of the microphones, which is hardware-tied to the LED, so you can be sure that when the LED says the mics are off they really are off. I'm not sure about the Google one.

Edit: at least the physical slide switch on the Home Mini is a hardware cut-off; I assume the same is true of the Home Max.


That's not a physical slide disable. As long as it is software that does the work it can be undone in software.


I don't see that as any sort of a solution. How can a physical switch authenticate itself?


That's the whole point: it doesn't have to. The user knows the state it is in because it is a physical switch. In the 'off' position there is absolutely nothing that mere software could do to move it to the 'on' position.


These switches interact with software or programmable, connected components. They don't physically disconnect the hardware in a way that only the switch itself can reverse.


Citation needed. I don't see why they wouldn't have made the slide switches physically disable the mics, otherwise what would be the point of using a physical slider switch? The whole point of the feature is privacy.

Edit: I hate people making claims with zero evidence, so here's some evidence for you: I just took apart my Google Home Mini. The mics are digital PDM mics, connected to a shared line (in stereo config), that goes to what is almost certainly an AND gate (tiny IC, can't quite find the part number, pinout matches a SN74LV1T00), with the other input connected directly to the mute switch (via some resistors), and the output to the SoC (via a resistor divider, probably because the SoC input is likely 1.8V logic). When the mute switch is engaged, the output of the AND gate, which is normally a TDM train (average half of 3.3V), goes to 0V. This is the output that goes to the SoC. So when the mute switch is engaged, the audio input from the mics is electrically cut off from the SoC.


I can't wait for linux mobile! Imagine making calls from a command line: Just type dial voice +1-555-1212 -ntwk verizon -prot cdma2000 -ssh-version 2 -a -l -q -9 -b -k -K 14 -x and away you go!


Please forgive (and remedy?) my ignorance, but are you saying all hardware speakers can be used as microphones?


Yep! If you want to try this, you can plug any speaker with a 3.5 mm output into a mic input and speak into it. It'll probably be faint, but it usually works.

https://security.stackexchange.com/questions/154343/can-a-sp...


Microphone-less headphones are a great way to test this phenomenon. If you have them handy, just plug them into the mic port and talk. IIRC only one earpiece in a stereo set connects with a mono microphone jack.


Thanks. I'm embarrassed not to have known this.



Like your phone?


Google allows you to change an app after Play Store approval without a second review? If so, well that's your problem right there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: