Even if true, not sure why you wrote such a snarky comment. This obviously is a side project of the author and just serves as an inspiration, it's not like he is asking money for it or anything.
They're not commenting on the quality of Jarvis (an excellent project), but on the headline. One of the key features of Echo, and one which is essential for something like this, is a good microphone. Most off the shelf ones are useless for this, and so the Echo provides fantastic value by delivering not only the voice recognition/action platform, but the hardware to make it work well.
I'd challenge someone to make a practical device that leverages Jarvis. Most microphones just aren't up to it.
I will say, I find the trend of HN posts that say "clone" of some commercial product but are missing significant features to be kind of frustrating. It tends to devalue the work that's actually involved in creating successful projects - and thus the work of software engineering.
It does have wake word detection, and even has that code right there in the article. It matches on jarvis/nervous/service/travis as the first word in the phrase, because those are the words that get returned by the speech recognition when he says 'jarvis'.
Some of that stuff isn't too hard, if you can narrow down the domain of words you need to recognize. For example, let's say you wanted to have a hot word of "computer", like in star trek, you can literally filter the output of the recognizer (e.g. pocketsphinx) with grep and sed and it works not too badly. For the natural language part, you can get pretty far with a simple parser like the old infocom games used, esp. if your domain is limited. I'm making an open source multiplayer networked starship bridge simulator, kind of like star trek, using pocketsphinx for speech recognition, and it's working ok (not perfect, but ok.) Here is a demo: https://www.youtube.com/watch?v=tfcme7maygw
You have to use a decent acoustic model - not the one in the demo. If you do I think it works 'pretty well' as a proof of concept. That said I'm not recommending Sphinx as a recognition framework, it is way behind the times in 2016, but this is the only 'in the wild' demo of this I've seen on the web, so I felt it was worth mentioning.
I thought that "beamforming" only applied to actively emitting signals, so that the waveforms would cancel/reinforce each other, to get the desired "direction".
I have no idea how that works for microphones. Google is not very helpful, I get lots of hits for products.
A much more intuitive name for beamforming is spatial filtering [1]. It just means using mutiple receivers along with knowledge about their location to filter out noise and other signals you don't want. The term also applies to emitters like phased radar arrays or MIMO cell towers which can use spatial filtering for beamforming but it's a general technique.
It's a reciprocal process, meaning that it works for transmit as well as receive. In the receive direction, the antennas/microphones sample the incoming waves at various spatial points, and then the cancelling/reinforcing occurs when the received signals are phase shifted or delayed and summed.
Of course this was more of a fun example of something you can do with the Page Nodes platform than an actual echo replacement. And a way of getting started connecting services.
Definitely move on to some dedicated hardware if you're serious about this sort of thing.
Development kits are expensive due to low sales volume, no cost optimization, and buyers' price insensitivity. There is a huge amount of value in having a known good implementation on hand when designing hardware or firmware. Additionally, having the development kit means your firmware team can start developing before your hardware arrives.
If you have a good relationship with them, sales reps will often give or lend development kits.
Even that XMos dev kit won't give you all you need for the beam-forming part of the microphone array:
"Customer adds own DSP to create differentiated product"
The DSP is where you'd implement the beam-forming algorithm to get one clear audio stream rather than 8...
I hope a chinese manufacturer makes one soon. The teardown for Echo had the parts for a bit more than that, but you could do it cheaper, multichannel ADC and half a dozen MEMS mics. I guess drivers would be the time consuming bit
The board actually doesn't appear to have an ADC, the microphones output pulse density encoded digital data wich can be directly received and interpreted by the processor.
Every comment that starts with "curious" seem to be spam on HN. Is this always the same guy or a bot? Almost every HN topic contains one of these "curious" comments...
It's a common English idiom, but even though I've been here over twice as long as your account date I can't say I've noticed an infestation of "curious" comments... though given curiosity is a hacker virtue, perhaps it's not surprising there may be more of it here? I still am genuinely curious how GGP would have dev boards priced instead -- the market is pretty much limited to students (the ones at good schools have the schools pay for the boards, or the schools get discounts) or professionals (who again expense through their company), what's the incentive to lower prices?
I've got one knocking about in the drawer. Let me know if you get beamforming working on it. I think there are multi channel audio drivers for Linux/RPi for PS3 Eye
I'm not sure why posters on HN are so eager to shit on other people's work. Maybe something to do with arrogance or insecurity (leading to a need to bash on others)?
Or maybe calling it an "Amazon Echo clone," which insinuates it covers almost all the bases Echo does, was a bit premature? It could have just been called "voice controlled PDA"
I don't think it's "shitting on" it (although text is notorious for making things seem far more serious or critical than they actually are), just clarifying the remaining differences.
If you are unaware, OP runs the best coworking meetup(s) in Phoenix. If you're a Phoenix dev and not coming to coffee and code, then you're missing out!
Louis/Alyson (since Jarvis was her project, I think): welcome to the #1 on HN club ;-)
I have to agree, it's nice that Phoenix has a pretty active node community... though it feels like everyone is too busy working to hit a lot of the meetings... I'd love to catch a coffee and code meetup, but I work too far away, and am in the office for morning standups etc, around that time.
Just the same, always get ideas from the Phoenix/AZ node user group meetings... it's also nice to see when someone demos an idea you had... such as routing redux at the server via websockets.
What is the current state of offline, non-cloud-connected speech-to-text?
My phone has a voice processing chip, and it recognizes my speech pretty well, but I still can't figure out if it's completely disconnected from the cloud (despite intentionally not agreeing to the privacy policy)
Kaldi is pretty good. Not sure if you can run it on a phone but definitely on a single desktop, purely local processing.
Results depend on the trained model, I think the Tedlium one is alright. And of course quality of input signal - far field/noisy much more prone to errors, that's where the mic array on Echo helps a lot.
> My phone has a voice processing chip, and it recognizes my speech pretty well, but I still can't figure out if it's completely disconnected from the cloud
I'm curious, anybody know if there's a simple way to wire this up to Home Assistant (https://home-assistant.io)? My first thought was MQTT, but for some reason PageNodes doesn't have any MQTT output support, which is kind of odd for something claiming to be an IoT connectivity platform.
HA does have a REST API, but with the way PageNodes works you'd have to hardcode the HA password right into the PN workflow. Have you considered adding the equivalent of environment variables, which can be set in a PN account and used as placeholders in workflows?
That's a good question. Our storage is in local indexeddb. And the site is https, so no one should see your flow if you don't share it.
That said there's nothing stopping you from reaching out to another secure service or plugin before making requests.
Ok, so less "prototype in the browser, then offload to a server once it works", but more for local "app" type things? Interesting idea, has some limitations but also opens up tons of interactions that are harder for a server-based solution (webcam, ...)
Bingo! It's always evolving too. We do use a lot of experimental flags from the browser, this helps work with up and coming features as well for learning about new APIs very easily.
Octoblu confuses me. I can't find anything about their pricing scheme (if any), while all their professional partners listed make me think there definitely has to be one hidden in there somewhere.
speech = new SpeechSynthesisUtterance(String("生日快乐"))
voice = speechSynthesis.getVoices()[80] // use Google Chinese Voice
speech.voice = voice
speechSynthesis.speak(speech)
This article is a bit out of my comfort zone (I'm not a web app developer), however it does link to a GitHub repository by Amazon, that I was unaware of, which shows how to configure a Raspberry Pi as an Echo clone in quite a lot of detail. This is something I can do and have bookmarked it for a rainy day. So for that alone, thanks for the submission!
Here's the previous HN discussion on that [0]. Keep in mind that the DIY Echo project doesn't support "always-listening" with a wake word, but instead you have to press a button to activate the voice-control. Not really that inconvenient though, and some people do prefer a button to something always listening.
Unfortunately, I think my first chatbot that I wrote about a year ago, and named Jarvis (based on Hubot by Github) will be cloned by a million of these projects...
I admit, it's not the most creative name; I just thought it would be cool a year ago to feel like Iron Man as I asked Jarvis to deploy my application to production...
* Beamforming microphone array (this is a real clone: http://www.xmos.com/products/microphones )
* Wake-word / hot-word detection ("Ok Google", "Alexa", etc.)
* Intent recognition / NLU