Since this is explicitly targeted at "the next billion users," do we have any sense of how well-optimized this is on non-English audio corpuses? I can't imagine that a model trained primarily on English/Western phonemes would perform as well on the rest of the world.
Unlikely, there's too much pride in each local language. Might all converge on English over a couple of generations, though, but more for commercial reasons.