Building those models sounds like a lot of continuous and boring work, and is probably best done by a company. Unless you find some crowd-sourced way of building those models.
Exactly, see the other comment in response to mine. If Mozilla can do this for speech recognition, then we can do this for image recognition/transcription tasks.