Voice interfaces are likely to be a big part of our future, and in order to have the best voice ML tech, you need a ton of real-world data. I wouldn’t be overly surprised to see them pay people to have/use these things.
I agree on the land grab. But for data, powers of ten are what we're talking about. Accents, regional phrasing, idioms, and mixing different languages together.
The models are moderately good at basic speech to text and some grammatical parsing. But there is a lot more to go. The simplicity of the command set isn't an indicator of their aspirations. You'd eventually want a system capable of understanding any utterance by any human, at least as good as any other human. And certainly not just in command syntax.