I wonder: how far, quality wise, are open source alternatives for speech recognition and synthesis? Not saying this is the only feature required but it is a starting point. I have playing with Google APIs and they are great but would be greater to to not rely on an external API.
Not very good, unfortunately. There is pocketsphinx which sort of works but not nearly as well as the online ones, and some other research projects that are very hard to even set up.
A while ago there was an article on here about a free (but not open) recognition engine that worked on a raspberry pi, but I forgot the name - the founders were hanging around here back then, maybe they can chime in? I haven't had a chance to try it though.
There was (or is) one called Mycroft which had promised big things but didn't deliver a compelling recognition and synthesis, although the recognition seems better than synthesis from what I have seen. They did open source much of the work at least initially. There is a video showing an example of how bad the UX was for a simple question asking about beans https://youtu.be/D5J7vVQNkCw.