Example 1, if we're referring to the likes of Siri and Alexa isn't thanks to improvements in personal computer technology - said platforms send your speech recording off to a massive datacenter for processing, no need for 10Ghz processors there.
Example 2 requires the use of depth-sensing and heat-sensitive cameras to avoid trivial "show a photo of an authorized user" attacks - that's not really CPU-dependent either.
Not sure if Siri and Alexa (and Google Now) send every recording to a data center - the .NET speech recognition libraries ship with Windows, so all audio data stays local to your PC, afaik. I'd expect Cortana leverages these libraries as well, instead of sending all data to a remote server.
You can build basic functionality into a speech bot in Powershell ("PowerShiri, what time is it?" "PowerShiri, what is the weather?") as a weekend project: https://news.ycombinator.com/item?id=11663029
Cortana sends it off to a server like everyone else. Those libraries are probably not Microsoft's latest and greatest.
Modern speech models are quite big, not so big that you couldn't load it on your desktop, but big enough that you would notice. Couple that wih the fact that search/or a service is going to happen on a server anyways, client side processing doesn't make sense beyond a few functions.
Alexa can only recognize the keywords. Which is also why you can only choose between 3, because the device isn't able to detect anything else. After that, the command will be analyzed in Amazon's datacenter.
I don't know of any smartphone or similar device that would interpret voice commands locally.
Android has had face unlock built in for some time. A naive way it mitigates the "unlock with photo" vulnerability is an option requiring you to blink. Not as robust as 3d, heat sensitive cameras but it's at least not a trivial as showing a photo to beat it.
The impressive amount of processing power available on many smartphones today certainly contributes to this being a practical unlock method.
You're right, you also need to sweep across the eyes with a pencil to defeat that. In case you're interested in more detail, Google starbug's excellent talk "ich sehe, also bin ich... du" (there's an English translation).
Siri sends data to the cloud for processing what you're asking for, but the actual speech-to-text can work locally these days. Try it - out your iPhone in airplane mode and fire up dictation - works like a charm.
How much do you pay for data, and how much do you dictate, for this to be an actual gripe?
I assume Google encodes the data in something like iSAC [1], requiring 32 kbit/s for good quality speech, so an hour of dictating is 3600 x 32/(8 x 1024) = 14 MB.
Example 2 requires the use of depth-sensing and heat-sensitive cameras to avoid trivial "show a photo of an authorized user" attacks - that's not really CPU-dependent either.