Hacker News new | past | comments | ask | show | jobs | submit login
Jarvis: an Amazon Echo clone in your browser (iceddev.com)
251 points by monteslu on April 21, 2016 | hide | past | favorite | 75 comments



Yeah it's a clone of the trivial parts of the Echo, but not the difficult parts that are necessary to make it great, specifically:

* Beamforming microphone array (this is a real clone: http://www.xmos.com/products/microphones )

* Wake-word / hot-word detection ("Ok Google", "Alexa", etc.)

* Intent recognition / NLU


Even if true, not sure why you wrote such a snarky comment. This obviously is a side project of the author and just serves as an inspiration, it's not like he is asking money for it or anything.


I think it's probably because the headline called it a clone, which sort of implies 1:1 feature parity. It's neat, but it's not a clone.


There are plenty of "iPhone clones" that don't come close to feature parity, so it seems a bit picky.


Pedantic stupidity.


No. Wow. Clarification is not Pedantry.


Did not sound snarky at all to me. Very factual, and had a very minimal, neutral tone to it.

Also informative to people who don't know echo at all.


They're not commenting on the quality of Jarvis (an excellent project), but on the headline. One of the key features of Echo, and one which is essential for something like this, is a good microphone. Most off the shelf ones are useless for this, and so the Echo provides fantastic value by delivering not only the voice recognition/action platform, but the hardware to make it work well.

I'd challenge someone to make a practical device that leverages Jarvis. Most microphones just aren't up to it.


I will say, I find the trend of HN posts that say "clone" of some commercial product but are missing significant features to be kind of frustrating. It tends to devalue the work that's actually involved in creating successful projects - and thus the work of software engineering.


Reminds me of the coding horror post, "Code: It's trivial" http://blog.codinghorror.com/code-its-trivial/


It is PR to sell one of their products



It does have wake word detection, and even has that code right there in the article. It matches on jarvis/nervous/service/travis as the first word in the phrase, because those are the words that get returned by the speech recognition when he says 'jarvis'.


That ambiguity in hotword detection makes me nervous.


Some of that stuff isn't too hard, if you can narrow down the domain of words you need to recognize. For example, let's say you wanted to have a hot word of "computer", like in star trek, you can literally filter the output of the recognizer (e.g. pocketsphinx) with grep and sed and it works not too badly. For the natural language part, you can get pretty far with a simple parser like the old infocom games used, esp. if your domain is limited. I'm making an open source multiplayer networked starship bridge simulator, kind of like star trek, using pocketsphinx for speech recognition, and it's working ok (not perfect, but ok.) Here is a demo: https://www.youtube.com/watch?v=tfcme7maygw


actually somebody even cross-compiled pocketsphinx to javascript with emscripten for this purpose:

https://syl22-00.github.io/pocketsphinx.js/live-demo.html

this works pretty well - all in the browser, especially if you drop in some better acoustic models.


Yeah I wouldn't call that "pretty well" - I said "not a number" and it outputted "one two one" on the digits example.

Maybe it just wasn't trained well enough to reject non-number inputs, but.. yeah doesn't exactly change my experience that Sphinx is awful.


You have to use a decent acoustic model - not the one in the demo. If you do I think it works 'pretty well' as a proof of concept. That said I'm not recommending Sphinx as a recognition framework, it is way behind the times in 2016, but this is the only 'in the wild' demo of this I've seen on the web, so I felt it was worth mentioning.


I thought that "beamforming" only applied to actively emitting signals, so that the waveforms would cancel/reinforce each other, to get the desired "direction". I have no idea how that works for microphones. Google is not very helpful, I get lots of hits for products.


A much more intuitive name for beamforming is spatial filtering [1]. It just means using mutiple receivers along with knowledge about their location to filter out noise and other signals you don't want. The term also applies to emitters like phased radar arrays or MIMO cell towers which can use spatial filtering for beamforming but it's a general technique.

[1] https://en.m.wikipedia.org/wiki/Beamforming


Thanks, I've always seen a Beamforming option in my DD-WRT Router but didn't know the exactly technical implementation.


Thank you!


It's a reciprocal process, meaning that it works for transmit as well as receive. In the receive direction, the antennas/microphones sample the incoming waves at various spatial points, and then the cancelling/reinforcing occurs when the received signals are phase shifted or delayed and summed.


Of course this was more of a fun example of something you can do with the Page Nodes platform than an actual echo replacement. And a way of getting started connecting services.

Definitely move on to some dedicated hardware if you're serious about this sort of thing.



Development kits are expensive due to low sales volume, no cost optimization, and buyers' price insensitivity. There is a huge amount of value in having a known good implementation on hand when designing hardware or firmware. Additionally, having the development kit means your firmware team can start developing before your hardware arrives.

If you have a good relationship with them, sales reps will often give or lend development kits.


Thx :)

My comment was more about the parent's comment of it being "a real clone" when a Dot is 1/8th the price of this.


Even that XMos dev kit won't give you all you need for the beam-forming part of the microphone array: "Customer adds own DSP to create differentiated product" The DSP is where you'd implement the beam-forming algorithm to get one clear audio stream rather than 8...


wow, that microphone array you linked costs USD$750! Are there any cheaper alternatives?


Not that I've found yet, but the actual hardware on that board costs something like $10-15, so someone could easily make one.

The high price is because it's a development board. (I think it's a silly to price development boards highly, but it is very common.)


I hope a chinese manufacturer makes one soon. The teardown for Echo had the parts for a bit more than that, but you could do it cheaper, multichannel ADC and half a dozen MEMS mics. I guess drivers would be the time consuming bit


The board actually doesn't appear to have an ADC, the microphones output pulse density encoded digital data wich can be directly received and interpreted by the processor.


I'm curious, how would you recommend they be priced?


Every comment that starts with "curious" seem to be spam on HN. Is this always the same guy or a bot? Almost every HN topic contains one of these "curious" comments...


It's a common English idiom, but even though I've been here over twice as long as your account date I can't say I've noticed an infestation of "curious" comments... though given curiosity is a hacker virtue, perhaps it's not surprising there may be more of it here? I still am genuinely curious how GGP would have dev boards priced instead -- the market is pretty much limited to students (the ones at good schools have the schools pay for the boards, or the schools get discounts) or professionals (who again expense through their company), what's the incentive to lower prices?


Xmos mainly sells development kits to showcase their chips - that's why they're that expensive.


I think the Microsoft Kinect comes with a Microphone Array.


PS3 Eye too, 4 microphones AFAIK. Those had a big discount on Amazon recently.

EDIT: Still are, 5 bucks.

http://www.amazon.com/PlayStation-Eye-3/dp/B000VTQ3LU


I've got one knocking about in the drawer. Let me know if you get beamforming working on it. I think there are multi channel audio drivers for Linux/RPi for PS3 Eye


Welp, looks like I just bought 3, haha! Thanks!


Yes Kinect has an array of 4 mics and can do audio beam forming. Plus there's a terrific SDK for Windows.


I said it in a thread about the Kinect - I think it's only 135 (ish) degrees of reception, and sometimes it can be more finicky than that.

I'd love to get my hands on an open source array.


I'm guessing that XCore product is well out of the price range of hobbyists, which is unfortunate.


I'm not sure why posters on HN are so eager to shit on other people's work. Maybe something to do with arrogance or insecurity (leading to a need to bash on others)?


Or maybe calling it an "Amazon Echo clone," which insinuates it covers almost all the bases Echo does, was a bit premature? It could have just been called "voice controlled PDA"

I don't think it's "shitting on" it (although text is notorious for making things seem far more serious or critical than they actually are), just clarifying the remaining differences.


I didn't mean to shit on it, only the disingenuous title.


Hey, it's monteslu!

If you are unaware, OP runs the best coworking meetup(s) in Phoenix. If you're a Phoenix dev and not coming to coffee and code, then you're missing out!

Louis/Alyson (since Jarvis was her project, I think): welcome to the #1 on HN club ;-)

/me snark


I have to agree, it's nice that Phoenix has a pretty active node community... though it feels like everyone is too busy working to hit a lot of the meetings... I'd love to catch a coffee and code meetup, but I work too far away, and am in the office for morning standups etc, around that time.

Just the same, always get ideas from the Phoenix/AZ node user group meetings... it's also nice to see when someone demos an idea you had... such as routing redux at the server via websockets.


You should try and make it out at least once. I loved it when I used to live there and definitely regret not going more since moving.


What is the current state of offline, non-cloud-connected speech-to-text?

My phone has a voice processing chip, and it recognizes my speech pretty well, but I still can't figure out if it's completely disconnected from the cloud (despite intentionally not agreeing to the privacy policy)

His demo is just a shim for Amazon's API...


Kaldi is pretty good. Not sure if you can run it on a phone but definitely on a single desktop, purely local processing.

Results depend on the trained model, I think the Tedlium one is alright. And of course quality of input signal - far field/noisy much more prone to errors, that's where the mic array on Echo helps a lot.

Here's a relatively easy way to set it up

https://github.com/alumae/kaldi-gstreamer-server


> My phone has a voice processing chip, and it recognizes my speech pretty well, but I still can't figure out if it's completely disconnected from the cloud

Does it work when you are in airplane mode?


It does.

There's an offline language pack installed, though.


I'm curious, anybody know if there's a simple way to wire this up to Home Assistant (https://home-assistant.io)? My first thought was MQTT, but for some reason PageNodes doesn't have any MQTT output support, which is kind of odd for something claiming to be an IoT connectivity platform.


Pure MQTT is done over TCP which browsers don't support without an extension. I'm trying to keep this purely web based as long as I can :)

Some MQTT servers tunnel messages via websockets, sever-sent-events, and REST calls which are supported by pagenodes.


HA does have a REST API, but with the way PageNodes works you'd have to hardcode the HA password right into the PN workflow. Have you considered adding the equivalent of environment variables, which can be set in a PN account and used as placeholders in workflows?


That's a good question. Our storage is in local indexeddb. And the site is https, so no one should see your flow if you don't share it. That said there's nothing stopping you from reaching out to another secure service or plugin before making requests.


Am I interpreting it right that Pagenodes basically aims to be node-red, but in the browser?


The goals are similar. PageNodes does its best to leverage newer browser capabilites. WebRTC, WebUSB, SeviceWorker, offline support, etc.


Ok, so less "prototype in the browser, then offload to a server once it works", but more for local "app" type things? Interesting idea, has some limitations but also opens up tons of interactions that are harder for a server-based solution (webcam, ...)


Bingo! It's always evolving too. We do use a lot of experimental flags from the browser, this helps work with up and coming features as well for learning about new APIs very easily.


You could use Octoblu Meshblu node in PageNode to connect Jarvis to an MQTT platform.


Octoblu confuses me. I can't find anything about their pricing scheme (if any), while all their professional partners listed make me think there definitely has to be one hidden in there somewhere.


I don't believe there is any pricing scheme at all as of yet. It's a great opportunity really, check it out!


Does anyone know how the Google TTS voices in the example are exposed by Google? Is there an API / service for these? I haven't been able to find it.

I was able to use them on PageNodes via the "espeak" output just fine, but would like to use them directly in my own apps.


The source for pagenodes is here: https://github.com/monteslu/pagenodes and you can look at src/editor/nodeDefs/core/espeak.js as an example.

Also MDN has some pretty good documentation on the WebSpeech API https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...



Super helpful. Thanks!

In Chrome the following works as a basic sample:

  speech = new SpeechSynthesisUtterance(String("生日快乐"))
  voice = speechSynthesis.getVoices()[80]  // use Google Chinese Voice
  speech.voice = voice
  speechSynthesis.speak(speech)


This article is a bit out of my comfort zone (I'm not a web app developer), however it does link to a GitHub repository by Amazon, that I was unaware of, which shows how to configure a Raspberry Pi as an Echo clone in quite a lot of detail. This is something I can do and have bookmarked it for a rainy day. So for that alone, thanks for the submission!


Here's the previous HN discussion on that [0]. Keep in mind that the DIY Echo project doesn't support "always-listening" with a wake word, but instead you have to press a button to activate the voice-control. Not really that inconvenient though, and some people do prefer a button to something always listening.

[0] https://news.ycombinator.com/item?id=11362460


Thanks!


Funny ad


Hi Louie.

- Rick


Unfortunately, I think my first chatbot that I wrote about a year ago, and named Jarvis (based on Hubot by Github) will be cloned by a million of these projects...

I admit, it's not the most creative name; I just thought it would be cool a year ago to feel like Iron Man as I asked Jarvis to deploy my application to production...


Are you trying to imply that this project is copying your name...?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: