Jarvis: an Amazon Echo clone in your browser

IshKebab · on April 21, 2016

Yeah it's a clone of the trivial parts of the Echo, but not the difficult parts that are necessary to make it great, specifically:

* Beamforming microphone array (this is a real clone: http://www.xmos.com/products/microphones )

* Wake-word / hot-word detection ("Ok Google", "Alexa", etc.)

* Intent recognition / NLU

kayoone · on April 21, 2016

Even if true, not sure why you wrote such a snarky comment. This obviously is a side project of the author and just serves as an inspiration, it's not like he is asking money for it or anything.

AnkhMorporkian · on April 21, 2016

I think it's probably because the headline called it a clone, which sort of implies 1:1 feature parity. It's neat, but it's not a clone.

givinguflac · on April 21, 2016

There are plenty of "iPhone clones" that don't come close to feature parity, so it seems a bit picky.

dumbguy · on April 21, 2016

Pedantic stupidity.

MawNicker · on April 21, 2016

No. Wow. Clarification is not Pedantry.

arnorhs · on April 21, 2016

Did not sound snarky at all to me. Very factual, and had a very minimal, neutral tone to it.

Also informative to people who don't know echo at all.

colinramsay · on April 21, 2016

They're not commenting on the quality of Jarvis (an excellent project), but on the headline. One of the key features of Echo, and one which is essential for something like this, is a good microphone. Most off the shelf ones are useless for this, and so the Echo provides fantastic value by delivering not only the voice recognition/action platform, but the hardware to make it work well.

I'd challenge someone to make a practical device that leverages Jarvis. Most microphones just aren't up to it.

jcrawfordor · on April 22, 2016

I will say, I find the trend of HN posts that say "clone" of some commercial product but are missing significant features to be kind of frustrating. It tends to devalue the work that's actually involved in creating successful projects - and thus the work of software engineering.

eterm · on April 22, 2016

Reminds me of the coding horror post, "Code: It's trivial" http://blog.codinghorror.com/code-its-trivial/

eva1984 · on April 21, 2016

It is PR to sell one of their products

monteslu · on April 21, 2016

It's open source :) https://github.com/monteslu/pagenodes

bazzargh · on April 21, 2016

It does have wake word detection, and even has that code right there in the article. It matches on jarvis/nervous/service/travis as the first word in the phrase, because those are the words that get returned by the speech recognition when he says 'jarvis'.

tamana · on April 22, 2016

That ambiguity in hotword detection makes me nervous.

smcameron · on April 22, 2016

Some of that stuff isn't too hard, if you can narrow down the domain of words you need to recognize. For example, let's say you wanted to have a hot word of "computer", like in star trek, you can literally filter the output of the recognizer (e.g. pocketsphinx) with grep and sed and it works not too badly. For the natural language part, you can get pretty far with a simple parser like the old infocom games used, esp. if your domain is limited. I'm making an open source multiplayer networked starship bridge simulator, kind of like star trek, using pocketsphinx for speech recognition, and it's working ok (not perfect, but ok.) Here is a demo: https://www.youtube.com/watch?v=tfcme7maygw

blackkettle · on April 22, 2016

actually somebody even cross-compiled pocketsphinx to javascript with emscripten for this purpose:

https://syl22-00.github.io/pocketsphinx.js/live-demo.html

this works pretty well - all in the browser, especially if you drop in some better acoustic models.

IshKebab · on April 24, 2016

Yeah I wouldn't call that "pretty well" - I said "not a number" and it outputted "one two one" on the digits example.

Maybe it just wasn't trained well enough to reject non-number inputs, but.. yeah doesn't exactly change my experience that Sphinx is awful.

blackkettle · on April 24, 2016

You have to use a decent acoustic model - not the one in the demo. If you do I think it works 'pretty well' as a proof of concept. That said I'm not recommending Sphinx as a recognition framework, it is way behind the times in 2016, but this is the only 'in the wild' demo of this I've seen on the web, so I felt it was worth mentioning.

outworlder · on April 21, 2016

I thought that "beamforming" only applied to actively emitting signals, so that the waveforms would cancel/reinforce each other, to get the desired "direction". I have no idea how that works for microphones. Google is not very helpful, I get lots of hits for products.

akiselev · on April 21, 2016

A much more intuitive name for beamforming is spatial filtering [1]. It just means using mutiple receivers along with knowledge about their location to filter out noise and other signals you don't want. The term also applies to emitters like phased radar arrays or MIMO cell towers which can use spatial filtering for beamforming but it's a general technique.

[1] https://en.m.wikipedia.org/wiki/Beamforming

Relys · on April 21, 2016

Thanks, I've always seen a Beamforming option in my DD-WRT Router but didn't know the exactly technical implementation.

outworlder · on April 21, 2016

Thank you!

alexforencich · on April 21, 2016

It's a reciprocal process, meaning that it works for transmit as well as receive. In the receive direction, the antennas/microphones sample the incoming waves at various spatial points, and then the cancelling/reinforcing occurs when the received signals are phase shifted or delayed and summed.

monteslu · on April 21, 2016

Of course this was more of a fun example of something you can do with the Page Nodes platform than an actual echo replacement. And a way of getting started connecting services.

Definitely move on to some dedicated hardware if you're serious about this sort of thing.

benlower · on April 21, 2016

$750 for the mic?!? (http://www.xmos.com/buy?product=20258)

GeorgeHahn · on April 21, 2016

Development kits are expensive due to low sales volume, no cost optimization, and buyers' price insensitivity. There is a huge amount of value in having a known good implementation on hand when designing hardware or firmware. Additionally, having the development kit means your firmware team can start developing before your hardware arrives.

If you have a good relationship with them, sales reps will often give or lend development kits.

benlower · on April 21, 2016

Thx :)

My comment was more about the parent's comment of it being "a real clone" when a Dot is 1/8th the price of this.

janekm · on April 22, 2016

Even that XMos dev kit won't give you all you need for the beam-forming part of the microphone array: "Customer adds own DSP to create differentiated product" The DSP is where you'd implement the beam-forming algorithm to get one clear audio stream rather than 8...

tchow · on April 21, 2016

wow, that microphone array you linked costs USD$750! Are there any cheaper alternatives?

IshKebab · on April 21, 2016

Not that I've found yet, but the actual hardware on that board costs something like $10-15, so someone could easily make one.

The high price is because it's a development board. (I think it's a silly to price development boards highly, but it is very common.)

dharma1 · on April 21, 2016

I hope a chinese manufacturer makes one soon. The teardown for Echo had the parts for a bit more than that, but you could do it cheaper, multichannel ADC and half a dozen MEMS mics. I guess drivers would be the time consuming bit

alexforencich · on April 21, 2016

The board actually doesn't appear to have an ADC, the microphones output pulse density encoded digital data wich can be directly received and interpreted by the processor.

xyience · on April 21, 2016

I'm curious, how would you recommend they be priced?

frik · on April 22, 2016

Every comment that starts with "curious" seem to be spam on HN. Is this always the same guy or a bot? Almost every HN topic contains one of these "curious" comments...

xyience · on April 22, 2016

It's a common English idiom, but even though I've been here over twice as long as your account date I can't say I've noticed an infestation of "curious" comments... though given curiosity is a hacker virtue, perhaps it's not surprising there may be more of it here? I still am genuinely curious how GGP would have dev boards priced instead -- the market is pretty much limited to students (the ones at good schools have the schools pay for the boards, or the schools get discounts) or professionals (who again expense through their company), what's the incentive to lower prices?

vidarh · on April 21, 2016

Xmos mainly sells development kits to showcase their chips - that's why they're that expensive.

whoisjuan · on April 21, 2016

I think the Microsoft Kinect comes with a Microphone Array.

xaduha · on April 21, 2016

PS3 Eye too, 4 microphones AFAIK. Those had a big discount on Amazon recently.

EDIT: Still are, 5 bucks.

http://www.amazon.com/PlayStation-Eye-3/dp/B000VTQ3LU

dharma1 · on April 21, 2016

I've got one knocking about in the drawer. Let me know if you get beamforming working on it. I think there are multi channel audio drivers for Linux/RPi for PS3 Eye

squeaky-clean · on April 21, 2016

Welp, looks like I just bought 3, haha! Thanks!

benlower · on April 21, 2016

Yes Kinect has an array of 4 mics and can do audio beam forming. Plus there's a terrific SDK for Windows.

voltagex_ · on April 22, 2016

I said it in a thread about the Kinect - I think it's only 135 (ish) degrees of reception, and sometimes it can be more finicky than that.

I'd love to get my hands on an open source array.

voltagex_ · on April 22, 2016

I'm guessing that XCore product is well out of the price range of hobbyists, which is unfortunate.

enraged_camel · on April 21, 2016

I'm not sure why posters on HN are so eager to shit on other people's work. Maybe something to do with arrogance or insecurity (leading to a need to bash on others)?

pmarreck · on April 21, 2016

Or maybe calling it an "Amazon Echo clone," which insinuates it covers almost all the bases Echo does, was a bit premature? It could have just been called "voice controlled PDA"

I don't think it's "shitting on" it (although text is notorious for making things seem far more serious or critical than they actually are), just clarifying the remaining differences.

IshKebab · on April 21, 2016

I didn't mean to shit on it, only the disingenuous title.

blhack · on April 21, 2016

Hey, it's monteslu!

If you are unaware, OP runs the best coworking meetup(s) in Phoenix. If you're a Phoenix dev and not coming to coffee and code, then you're missing out!

Louis/Alyson (since Jarvis was her project, I think): welcome to the #1 on HN club ;-)

/me snark

tracker1 · on April 21, 2016

I have to agree, it's nice that Phoenix has a pretty active node community... though it feels like everyone is too busy working to hit a lot of the meetings... I'd love to catch a coffee and code meetup, but I work too far away, and am in the office for morning standups etc, around that time.

Just the same, always get ideas from the Phoenix/AZ node user group meetings... it's also nice to see when someone demos an idea you had... such as routing redux at the server via websockets.

warbrett · on April 21, 2016

You should try and make it out at least once. I loved it when I used to live there and definitely regret not going more since moving.

tomc1985 · on April 21, 2016

What is the current state of offline, non-cloud-connected speech-to-text?

My phone has a voice processing chip, and it recognizes my speech pretty well, but I still can't figure out if it's completely disconnected from the cloud (despite intentionally not agreeing to the privacy policy)

His demo is just a shim for Amazon's API...

dharma1 · on April 21, 2016

Kaldi is pretty good. Not sure if you can run it on a phone but definitely on a single desktop, purely local processing.

Results depend on the trained model, I think the Tedlium one is alright. And of course quality of input signal - far field/noisy much more prone to errors, that's where the mic array on Echo helps a lot.

Here's a relatively easy way to set it up

https://github.com/alumae/kaldi-gstreamer-server

travem · on April 21, 2016

> My phone has a voice processing chip, and it recognizes my speech pretty well, but I still can't figure out if it's completely disconnected from the cloud

Does it work when you are in airplane mode?

tomc1985 · on April 22, 2016

It does.

There's an offline language pack installed, though.

zyxley · on April 21, 2016

I'm curious, anybody know if there's a simple way to wire this up to Home Assistant (https://home-assistant.io)? My first thought was MQTT, but for some reason PageNodes doesn't have any MQTT output support, which is kind of odd for something claiming to be an IoT connectivity platform.

monteslu · on April 21, 2016

Pure MQTT is done over TCP which browsers don't support without an extension. I'm trying to keep this purely web based as long as I can :)

Some MQTT servers tunnel messages via websockets, sever-sent-events, and REST calls which are supported by pagenodes.

zyxley · on April 21, 2016

HA does have a REST API, but with the way PageNodes works you'd have to hardcode the HA password right into the PN workflow. Have you considered adding the equivalent of environment variables, which can be set in a PN account and used as placeholders in workflows?

monteslu · on April 21, 2016

That's a good question. Our storage is in local indexeddb. And the site is https, so no one should see your flow if you don't share it. That said there's nothing stopping you from reaching out to another secure service or plugin before making requests.

detaro · on April 21, 2016

Am I interpreting it right that Pagenodes basically aims to be node-red, but in the browser?

monteslu · on April 21, 2016

The goals are similar. PageNodes does its best to leverage newer browser capabilites. WebRTC, WebUSB, SeviceWorker, offline support, etc.

detaro · on April 21, 2016

Ok, so less "prototype in the browser, then offload to a server once it works", but more for local "app" type things? Interesting idea, has some limitations but also opens up tons of interactions that are harder for a server-based solution (webcam, ...)

samrocksc · on April 21, 2016

Bingo! It's always evolving too. We do use a lot of experimental flags from the browser, this helps work with up and coming features as well for learning about new APIs very easily.

cmatthieu · on April 21, 2016

You could use Octoblu Meshblu node in PageNode to connect Jarvis to an MQTT platform.

zyxley · on April 21, 2016

Octoblu confuses me. I can't find anything about their pricing scheme (if any), while all their professional partners listed make me think there definitely has to be one hidden in there somewhere.

samrocksc · on April 21, 2016

I don't believe there is any pricing scheme at all as of yet. It's a great opportunity really, check it out!

verst · on April 21, 2016

Does anyone know how the Google TTS voices in the example are exposed by Google? Is there an API / service for these? I haven't been able to find it.

I was able to use them on PageNodes via the "espeak" output just fine, but would like to use them directly in my own apps.

monteslu · on April 21, 2016

The source for pagenodes is here: https://github.com/monteslu/pagenodes and you can look at src/editor/nodeDefs/core/espeak.js as an example.

Also MDN has some pretty good documentation on the WebSpeech API https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...

warbrett · on April 21, 2016

This might help you: https://github.com/monteslu/pagenodes/blob/7388ce58726836cf6...

verst · on April 25, 2016

Super helpful. Thanks!

In Chrome the following works as a basic sample:

  speech = new SpeechSynthesisUtterance(String("生日快乐"))
  voice = speechSynthesis.getVoices()[80]  // use Google Chinese Voice
  speech.voice = voice
  speechSynthesis.speak(speech)

Jaruzel · on April 21, 2016

This article is a bit out of my comfort zone (I'm not a web app developer), however it does link to a GitHub repository by Amazon, that I was unaware of, which shows how to configure a Raspberry Pi as an Echo clone in quite a lot of detail. This is something I can do and have bookmarked it for a rainy day. So for that alone, thanks for the submission!

squeaky-clean · on April 21, 2016

Here's the previous HN discussion on that [0]. Keep in mind that the DIY Echo project doesn't support "always-listening" with a wake word, but instead you have to press a button to activate the voice-control. Not really that inconvenient though, and some people do prefer a button to something always listening.

[0] https://news.ycombinator.com/item?id=11362460

Jaruzel · on April 22, 2016

Thanks!

known · on April 22, 2016

Funny ad

rickcarlino · on April 21, 2016

Hi Louie.

- Rick

illumin8 · on April 21, 2016

Unfortunately, I think my first chatbot that I wrote about a year ago, and named Jarvis (based on Hubot by Github) will be cloned by a million of these projects...

I admit, it's not the most creative name; I just thought it would be cool a year ago to feel like Iron Man as I asked Jarvis to deploy my application to production...

blhack · on April 21, 2016

Are you trying to imply that this project is copying your name...?