"Duh, they want to train a speech-to-text parser, and that's the best way to do ...

"Duh, they want to train a speech-to-text parser, and that's the best way to do it."

I'm not sure I agree. But what do I know? The problem to me seems to be individual speakers who can vary so greatly as to make any one prediction, based on a population-phoneme histogram, very noisy. Speech recognition works decently now when individuals train their own classifiers. I'm not sure why products don't coalesce around those specific voice profiles. Here's an example where, I think, one massive effort to collect many examples is actually more problematic than helpful. It may work for easily discriminated phonemes (e.g., one, two, three, four) but the pronounced lexicon is a minefield of starlust knights. Context helps greatly, but that's much more data than would be acquired in a 411 call.

We shall see...