Hi all, author here. Besides the tech of Mimic 3 itself, I'm interested in training voices in as many (human) languages as possible. All it takes is one person willing to donate a dataset for everyone to benefit!
...well, that and a bunch of stuff with phonemes. But I'll do that part :)
The Mozilla Common Voice dataset is awesome - however it's useful the opposite purpose - speech-to-text. This is because it is a lot of different people using a range of hardware, speaking similar phrases.
For good text-to-speech you need 1 person speaking different phrases but very consistently. Here's an example dataset from Thorsten a German open voice enthusiast: https://openslr.org/95/
What does it take to add Chinese and Japanese to this? Surely it's a lot more than just training sets right? I have an android phone without access to google tts, so this might actually potentially be a nice alternative.
They want you to make good quality audio recordings of you speaking about 20 000 phrases. It could take 40 to 80 hours of speaking and recording, maximum 4 hours per day.
The amount of data depends on if there's a voice for the language already. If so, about 2 hours of data is usually good enough. Otherwise, 10-20 hours usually does it.
Does anyone know just how much of the total functionality of Mycroft is actually running on the Raspberry Pi? I asked this question four years ago on Reddit (I'll paste the response below) and now I wonder if things have changed, particularly with regard to speech to text.
There are several ‘layers’ to a voice assistant;
Wake Word - that detects when you are speaking to the device. This is local to the device and we use PocketSphinx.
Speech to text - that detects what you say to determine Intents - we currently use a cloud service for this
Intent matching - this is done locally using our own open source software - Adapt and Padatious
Skills - Intents then match to Skills. Some Skills require internet connectivity.
Text to Speech - We use our own software called Mimic for this, it’s local to the device.
> In order to provide an additional layer of privacy for our users, we proxy all STT requests through Mycroft's servers. This prevents Google's service from profiling Mycroft users or connecting voice recordings to their identities. Only the voice recording is sent to Google, no other identifying information is included in the request. Therefore Google's STT service does not know if an individual person is making thousands of requests, or if thousands of people are making a small number of requests each.
Well, unless Google does voiceprint analysis. But Google wouldn't do that, would they? /s
Beyond that, if I'm reading right, local STT will still require a separate STT server. It won't run on the Mark II itself, right?
The blogpost linked in this submission says the following:
> Mimic 3: Mycroft’s newer, better, privacy-focused neural text-to-speech (TTS) engine. In human terms, that means it can run completely offline and sounds great. To top it all off, it’s open source.
If "skills" are what I think they are (something like external commands, for example "Play X on Spotify"), then my understanding would be that everything but those runs offline and local-only.
But if things like `speech to text` requires internet connection and sends the data to some cloud service, then the entire value proposition of this product falls apart.
I hope that's really not the case, as that would be outright lying and false advertisement.
The raspberry pi 3 that is used in older products doesn't have enough power to be all offline. Maybe you could setup a server at home (but they won't help you!), but you cannot do it on the hardware they have. The next gen mycroft 2 (should ship this fall - first announced many years ago) will have a pi 4 which might have enough power to run offline, this isn't clear yet.
> might have enough power to run offline, this isn't clear yet
It's very unclear and misleading to put "it can run completely offline" if you're not 100% sure it can actually run "completely offline", hardware be damned.
If they're marketing it as "fully offline", they ought to be doing the speech to text bit locally now. Worked on part of a platform which could use Rasa for this a couple of years ago, but running on something a bit more powerful than a Raspberry Pi!
I've been dying to replace my Echos with an open source smart speaker but half of them use AWS or Azure for test to speech and speech synthesis so really all you are in control of is the software that runs on the device itself. So this is a coo step in the right direction.
The Rhasspy [0] author recently got hired by mycroft to work on satelites and fully local. Rhasspy requires a lot of manual work, but replacing Alexa is already possible. I’m somewhat stuck with the current hardware availability issues, but I have a Pi 3 satellite that does wakeword detection (this is supposed to be handled by Pi Zero 2 W in the future) and sends the voice to the MQTT server running on a PI 4, the data gets picked up by the Rhasspy instance also running there, it does STT, intent recognition, sends the intent to home assistant and then does TTS back to the satellite.
My main software issue is currently how to replicate the music functionality. Playing music at the satellite that requested it, lowering the volume when it recognizes the wakeword. Preselection of "commands" for band and genre names should be easily scriptable afterwards.
In a quiet room, I have no issues with wakeword detection using a playstation eye camera (I wanted the seed USB microhphone array, but between discovering it and starting with buying hardware the supply chain bit once again)
Playing music from a Plex server is a major use case for me, and I have given up on Rhasspy because I couldn’t get all the pieces to work together (I have the mic array HAT and a Synology I can run recognition on). Do you have a write-up of your setup?
> Playing music from a Plex server is a major use case for me, and I have given up on Rhasspy because I couldn’t get all the pieces to work together (I have the mic array HAT and a Synology I can run recognition on). Do you have a write-up of your setup?
I have not yet managed / worked enough on it (the lack of HW making everything theoretical, which kills my motivation). The way I understand it, is that there’ll either be a casting server on the satelite, or a pulse audio/pipewire server reachable via network. But I have next to no experience with consumer linux, so the configuration of those parts is… hard.
But there are many tutorials for playing multi-room audio (with icecast or something), I just assumed it would be easier without multi-room as I don’t need it, but it turns out it’s not ;)
Yeah we aren't using the seeed array in the final Mark II. But we have used the same XMOS XVF-3510 to perform acoustic echo cancellation. That means, even with music blasting out of the speakers, you can still wake the device from across the room.
In a simple fashion you can think of it as subtracting the audio being output from the audio coming in from the microphone.
They're also just really not great. I tested out Mycroft a couple years ago and found that the success rate for getting it to understand its wake word and listen for commands was under 10%. Maybe if you buy their prepackaged product, it works better, but that's not something I want to do. I just want to run it on a Pi 4 (which they claim works) with a mic array.
Yeah I think there are two sides to this coin (and just for clarity - all of this relates to Picroft, not Mimic 3 the TTS engine that just launched). The audio hardware makes a huge difference to audio input which is why we've developed the custom SJ201 board that's in the Mark II. But even on DIY units we have been making big improvements on the wake word detection by better balancing our training data sets. Once the Mark II is shipping there are additional wake word improvements on the roadmap. Eventually the system will optimize for the users of each device. So the wake word model on your device wouldn't be exactly the same as the model on mine. We've also ported the Wake Word model to Tensorflow Lite which means it uses a small fraction of the system resources that it used to :D
We're also about to make some bigger changes to mycroft-core that will help to support a broader range of hardware in a more consistent way. So whilst you could try it again today and I can guarantee it's better than the last time you used it, if you want a DIY system instead of a Mark II - I'd suggest adding a reminder to check it again in a couple of months once these bigger changes land.
I find Siri really useful - for a very limited set of tasks where recognition is about 100% and being hands free has a benefit. Typically this is starting exercise workouts and countdown timers. For more general tasks the recognition is still good (for me, seems to cary by voice) but even at 90% there will be one mistakes in most requests.
They’ve done a lot of work in the last year on the software side. Might be worth revisiting. They’re tentatively on track to (finally!) ship in September of this year.
This looks awesome and I love seeing FOSS, privacy-first equivalents of Big Tech. The video was really, really cute – and you could hear the improvements of the tech as time went on. I must confess that my initial thought about watching it was that it was something to help blind or partially sighted people, however, as a document-to-words reader. Only later did I twig that they are essentially Alexa-speaker-alikes.
Therefore, I'll ask the question I always think of when I see smart speakers: what exactly is their use case? I've never used voice assistants. I've never had a PA. I have a variety of good, dumb speakers. If I am cooking, I have the radio on in the background and a smartphone in my pocket if I desperately wish to change something. I've always thought that the voice recognition was cool, but I've just never quite recognised a position where I would use it!
For the record, I live in a house with at least two raspberry pis on all the time (one as a DTV tuner) so I am far from a luddite in that regard. I just genuinely don't really know what use-case a smart speaker solves. Please enlighten me!
I use mine for at least half a dozen timers on an average day. The more you use it, the more often you get the impulse to just set another timer, be it for "remember to stop playing that game and be productive" or remembering to leave the house on time, because it's so simple. I also use it to turn off/on the tv and lights. Not much of a point if it's a single one, but helpful when it's a number of lights, e.g. when we go to bed or leave the house, and you can address them all with a group ("alexa turn off downstairs"/"alexa turn off everything").
And to play music. Just asking it to play a track and then asking it to play similar music (very hit and miss), for example, and then asking what's playing, all without having to reach for my phone, finding an app etc.
My experience was that I bought my first one mostly because I wanted something to play music on in the living room anyway and didn't really care about getting a full on stereo setup as I'm not very picky about the sound quality, but I was curious. I never use voice assistants on my phone. But I found myself using it more and more as I got used to being able to turn things on/off without reaching for anything or when my hands where otherwise full.
It's not something I'd have the slightest difficulty of living without, but it feels like it's decreasing friction for a lot of small things.
I now have four - one by my desk, one in the living room, one in my sons bedroom and one in mine.
Saying “hey dingus, add $item to the shopping list” is a killer feature for me. It’s so much easier than adding something manually on your phone (especially if you have your hands full cooking). Reminders and timers are something I use too. It’s also pretty good for playing music when you don’t have something particular in mind: “play some classical music” for example.
It’s definitely something I could live without, but even Apple’s speaker is pretty cheap. Especially so if it’s your main speaker (if you care about audio quality it might be a problem, but I don’t so it’s not).
What an awesome project! And AGPL is really perfect for this kind of work.
What's the BOM look like? I'd love to understand more about the design. The software's open source, right? After a brief skim I didn't see a repo link. Does anyone know where the source is? Do they use an AI accelerator DSP/TPU or just plain-old-software-on-a-CPU?
Regarding the BOM, I assume you mean the Mark II? That you can find here:
https://github.com/MycroftAI/hardware-mycroft-mark-II/tree/m...
We actually ended up designing our own RPi daughterboard called the SJ201. It's mostly an audio front end with an XMOS XVF-3510 and dual mics, but also includes a 23W amp, some LEDs for feedback, buttons, a hardware mic switch, GPIO breakout and power management (amongst other things).
No Android release yet unfortunately, but you can drop your email in the bottom of this page and select the platforms you are interested in to be notified about specific future releases:
https://mycroft.ai/mimic-3/
That's a wonderfully effective marketing video. It's funny, gives me a background on the technology itself, and effectively highlights the new features.
How do you actually use it on a project? I see where you can order a dedicated piece of hardware, but I'd love to download this and replace pyttsx3 on my homemade IoT linux server.
But all I see is documentation, discussion of what they used to build it, and.... where's the actual softare?!?
Any chance you'll add just the SJ201 board on its own to your store? I'd love to experiment with my own case designs, but already have too many PCBA projects on my TODO bench, and am totally okay with paying a premium for a pre-assembled RPi daughter board.
We do get this request a bit but not yet at the scale where it is economically viable for us to do so.
I would however point out that the Mark II is completely hackable. So whilst it's not an SJ201 on its own, you can absolutely pull the whole thing apart, use it in other enclosures and even put it all back together again.
Another one of the reasons we made it is because having a single daughterboard greatly simplified production and made the Mark II more robust overall. There's no longer the possibility of a loose wire to the power supply or amp after it gets kicked around in the back of a delivery van. They're all in one and connected via the 40 pin GPIO header, but absolutely removable from the Mark II unit itself.
How well does Mycroft integrate with home assistant?
I've previously used rhasspy (by the author of this feature apparently, he got a job at mycroft and stopped development of rhasspy) with some custom scripts as glue to home assistant as my local-only smart home solution.
Can I do this with mycroft? Does it maybe come with a Home Assistant integration? My main worry is that mycroft seems to be doing a bit too much for my taste. Is there functionality overlap/conflict between HA and Mycroft?
I just tried the docker command from the readme to test it and unfortunately it's not working :(
docker run -it -p 59125:59125 -v "${HOME}/.local/share/mycroft/mimic3:/home/mimic3/.local/share/mycroft/mimic3" 'mycroftai/mimic3'
UI loads but when I click on the speak button I get this error: PermissionError: [Errno 13] Permission denied: '/home/mimic3/.local/share/mycroft/mimic3/voices'
On Android, years ago I could install IVONA (https://nextup.com/ivona/) for free. Worked totally offline and was better than embedded Google TTS and the Mycroft voices from youtube presentation. Now looks like the app is not in the App store any more. Then whas purchased by Amazon and disappeared from Google store.
Oh, I've been waiting to try this out! Love to see the multitude of ways to interact with it. The webserver seems really nice. Gonna try it out with Home Intent to see how it works with a full voice assistant on a pi 4.
Very cool, thank you! I am using the mimic3 cli and using --voices to get a list of voices, but after that I can't figure out how to find the list of speakers for that voice.
edit: Ah I see they are in speakers.txt in the git repo for the voice.
How would the licensing of the audio output work? If I wanted to use this to record some TTS style files for a game - would there need to be attribution? It wouldn't be shipping or using the actual code, just the final audio.
The audio for all of the voices produced from public data sets is licensed as Creative Commons Attribution Share Alike 4.0 International. This is what a lot of the original data is licensed as.
I have an idea for an OpenAI GPT3's integrated chat assistant. Would this product be a good fit to do it? I mean, hardware wise and then put custom software on top without too much tinkering?
When people think about using GPT-3 for real products, I always wonder about the running costs. These models are so big that even inference can be fairly costly to run. Any thoughts on that?
Davinci is expensive, but the other models can be quite affordable even with finetuning. I dont see problems for a personal assistant based on them (edit: from a personal usage point-of-view. if you are making a product for others to use, that's different :)
Since when are they having concrete plans and shipping windows to sell Mark II?
A few years ago I signed up to be informed when Mark II finally sells. I already had given up.
I ordered the Mark II via the original Kickstarter campaign, somewhere back in the mists of time.
To be fair, they've communicated fairly well recently. There were periods when things went too quiet, but in the last year or so we've been getting monthly updates that have real information buried in the usual hype and boosterism.
I think the shipping schedule they're working towards now (September) is unlikely to actually hit the promised date: they've used up most if not all the slack already, but it seems like obstacles are falling regularly at least. The latest update (this week) mentioned that they need to redo their FCC certification tests, for instance, but they've found and fixed the problem already.
I'm somewhat confident of a late-2022/early-2023 delivery (vs the original campaign's promised Dec 2018).
It's all a question of what data we have at the moment.
If you know of any good Portuguese voice datasets we'd love to train one. At some point we'll start getting new professionally recorded data for each language too.
I really like the fact that it can be used offline. Sadly it seems to be AGPL licensed, which makes it pretty much untouchable by anything adjacent to any sort of business which employs lawyers.
There's a big difference between pure AGPL and bait-and-switch AGPL. IMO using AGPL to steer people towards paid proprietary licenses is not good for open source.
It only steers you to proprietary licences when you want to use other people's code without sharing with others the benefits that you got by using open source software.
Use others to gain an advantage, then pull up the ladder after you.
I'm not surprised that large corporations want to have their cake, eat it and not pay for it but it's not a compelling argument for people producing opensource software
The linked post seems to indicate there are commercial licensing options. Of course, businesses would prefer to use this for free, but I'm sure they can pay for a license that doesn't bind them to the terms of the AGPL.
Big companies want to pay and have “support.” Lawyers don’t like licenses, they want a proper contract.
But if it’s AGPL, I’ll move on. They’re too idealistic to deal with megacorp. They don’t sell through intermediaries, can’t/won’t be insured enough, etc.
Inevitably the small company sees $$$ and goes nuts trying to sell even when warned not to. Then the risk management and contract stuff kicks in, drags out, and kills it.
The small company blew a bunch of cash on nothing. Or they base sales projections on it. I’ve seen a couple go under.
Those are people’s livelihoods and I can’t do that to them.
That’s not limited to AGPL, but the license is a signal of that type of business. And they never do the smart thing with megacorp and sell through an established intermediary.
I mean, that's fine, though, right? If they don't have the headcount to sell support contracts, and that's not the kind of company they want to build, that's their choice. I sympathize; running a company like that doesn't sound particularly pleasant. You quickly lose focus on building the actual technology and spend too much time on enterprise sales. And if they don't want to give away their software under terms that let people put it behind a web service without releasing changes, that's also their choice.
Most companies are fine with BSD/MIT software with absolutely no support. Your argument holds absolutely no water compared to what they actually want which is to take other people's work and close it up for profit.
If they want a license to close it up, they can pay for it.
But why? It's released as a docker container, say I deploy this container, as-is, with Cloud Run or Lambda or whatever, since I'm not "linking" my code against it, I'm just sending it SSML requests as one of several TTS backends, no virality provision applies, I just have to provide users with a download link to the same upstream code I am using. I imagine a lawyer working at a technology company can understand the use case I just described and pattern-match it against the AGPL requirements, what am I missing?
Sadly this is the same sentiment felt by Foss people around proprietary and even stuff like MIT licensed things. MIT is cool and all but it can be used to build proprietary software which is like ideologically troubling.
Guess both sides have strong feeling why something is "bad".
I personally only contribute to agpl software, try to use agpl/sspl and similar licensed software.
Guess the " corporate mentality" and "Foss mentality" don't meet unless it's a corp built around Foss products and ideology.
Look at valve and their deck compared to Nintendo switch.
Nice, but the choice to implement a resource intensive AI project (CPU/RAM) on Pie in Python is baffling. I could not think of a worse implementation language for a resource constrained environment ¯\_(ツ)_/¯.
Python is only really the glue here. The models are trained in PyTorch and exported to Microsoft's Onnx runtime (C++). So the bulk of the inference CPU cycles are outside Python.
...well, that and a bunch of stuff with phonemes. But I'll do that part :)