Speech-to-Text-WaveNet: End-to-end sentence level English speech recognition

gambler · on Nov 25, 2016

"Some of Deepmind's recent papers are tricky to reproduce. The Paper also omitted specific details about the implementation, and we had to fill the gaps in our own way."

So, I'm not the only one seeing this issue. It seems like many recent AI papers want to look as impressive as possible, wile giving you as little implementation info as possible. This bothers me, because it opposes the very purpose of research publication.

deepnotderp · on Nov 25, 2016

This is more specific to deepmind actually, Facebook and others have been pretty good about publishing code.

CamperBob2 · on Nov 25, 2016

Unfortunately I think you'll find similar complaints in every scientific field. Often, results either aren't described well enough to be reproduced, they're too expensive or difficult to reproduce, or they rely on closed-source software and/or inadequately-documented hardware.

bmc7505 · on Nov 25, 2016

A few weeks ago, a deep learning researcher at one of the world's leading speech groups told me off-the-record that offline, human-parity speech recognition would be "coming soon" to mobile devices. Not sure s/he realized just how soon that would be. Even though state-of-the-art ASR is really expensive to train, recognition is extremely cheap to run, even on lower-power devices. [1][2] With specialized silicon, you can do this, continuously, for free, on something like a smartwatch. You don't need to open a websocket or call an API running on some beefy server to do this, speech-to-text is now a basic commodity. Fully offline, ubiquitous speech recognition is right around the corner. With human-level speech synthesis [3], speech applications are going to get very interesting, very quickly.

[1] http://niclane.org/pubs/deepx_ipsn.pdf

[2] https://www.ibr.cs.tu-bs.de/Cosdeo2016/talks/invitedTalk.pdf

[3] https://github.com/ibab/tensorflow-wavenet

braindead_in · on Nov 25, 2016

A consumer focused human parity ASR service will disrupt so many industries, including mine. I run a human powered transcription service where we transcribe files with high accuracy. I am just waiting for the day when our transcribers can work off a auto-generated transcript instead of typing it all up manually. I'll pay good money for a service where I can just send a file and get a 80-90% accurate transcript with speaker diarization.

skoocda · on Nov 25, 2016

We've chatted - just an update that I'm implementing diarization this weekend :)

imaginenore · on Nov 25, 2016

I hope you realize your business is about to go out of business. The only reason you can charge people now is because the automatic recognition sucks compared to humans.

braindead_in · on Nov 25, 2016

We do super-human-parity transcripts. Our transcripts are insanely accurate, even for challenging files. I'm sure computers will be able to do that one day, but Singularity would have already happened by then, wiping out many businesses. I for one look forward to Singularity and hope that we will contribute to it in some way.

imaginenore · on Nov 25, 2016

What's super-human parity? And how do you achieve it using humans?

epistasis · on Nov 25, 2016

Presumably more accurate than a single human, and you can do it with multiple humans and reaching a consensus. I remember an anecdote in physics class where an experiment required counting a certain number of events in time. A single person would occasionally blink and miss an event. But if you had two people, and you count how many people observed each event, you can solve for super-human accuracy using the estimated error rates of each person.

See also this usage in the context of ML:

https://arxiv.org/pdf/1602.05314v1.pdf

gwern · on Nov 26, 2016

Ensembles are well-known to be more accurate. But this is not an advantage exclusive to humans: NNs ensembled will do better than any of the individual NNs.

There's no reason one couldn't train 5 or 10 RNNs for transcription and ensemble them. (Indeed, one cute trick this ICLR was how to get an ensemble of NNs for free so you don't have to spend 5 or 10x time training: simply lower the learning rate during training until it stops improving, save the model, then jack the learning rate way up for a while and start lowering it until it stops improving, save that model, and when finished, now you have _n_ models you can ensemble.) And computing hardware is cheaper than humans, so it will be cheaper to have 5 or 10 RNNs process an audio file than it would be to have 2 or 3 humans independently check, so the ensembling advantage is actually bigger for the NNs in this scenario.

Humans still have the advantage of more semantic understanding, but RNNs can be trained on much larger corpuses and read all related transcripts, so even there the human advantage is not guaranteed.

visarga · on Nov 26, 2016

Yeah, but you don't want to run an ensemble of 10 RNNs on your phone, or in the cloud for that matter, when you got billions of queries. It's too expensive.

In practice the ensemble model is compactly transferred into a single network. In order to do that, they train a new network to copy the outputs of the ensemble, exploiting "dark knowledge".

Recurrent Neural Network Training with Dark Knowledge Transfer - https://arxiv.org/abs/1505.04630v5

bertiewhykovich · on Nov 25, 2016

I'm rooting against you, pal. Cheers.

kcorbitt · on Nov 25, 2016

This is really exciting. I previously worked at a startup for that could have benefited enormously from even 90% accurate speech recognition. As of six months ago when I last looked, there were no open source speech-to-text libraries with anything approaching the performance of the proprietary work by Google, Microsoft, Baidu, etc. The closest thing was CMU Sphinx, but its accuracy was unacceptable.

Props to the author, and especially to the DeepMind researchers who published their work! I look forward to living in a world where this type of technology is ubiquitous and mostly commoditized.

bmc7505 · on Nov 25, 2016

The CMU Sphinx project as it stands is basically dead. Even though they recently implemented some sequence-to-sequence deep learning techniques for g2p [1], the core stack is still based on an ancient GMM/HMM pipeline, and current state of the art projects (even open source ones) have leapfrogged it in terms of accuracy. If you're implementing offline speech recognition today, start with something like this or Kaldi-ASR [2]. It will take a bit of work to get your models to running on a mobile device, but the end result will be much more usable.

[1] http://cmusphinx.sourceforge.net/2016/04/grapheme-to-phoneme...

[2] http://kaldi-asr.org/

snadal · on Nov 25, 2016

We've worked in the past with CMU Sphinx too, and it is absolutely amazing the advances in this area in the last months.

A little bit off-topic, but do you know any recent work or paper for speech recognition in language teaching area ? (I mean, analysing and rating accuracy of speaker, detect incorrect pronunciation of phones, and so on)

bmc7505 · on Nov 26, 2016

> Do you know any recent work or paper for speech recognition in language teaching area?

What you're describing is called "speech verification". Language education is an application I'm personally very interested in, and one that almost no one discusses in the speech community (I assume because of machine translation), so if you find any research papers please let me know! I wrote a little about it: http://breandan.net/2014/02/09/the-end-of-illiteracy/

The task is actually much simpler than STT. You display some text on the screen, wait for an audio sample, then check the model's confidence that the sample matches the text. If the confidence is lower than some threshold, then you play the correct pronunciation through the speaker. The trick is doing this rapidly, so a fast local recognizer is key. I've got a little prototype on Android, and it's pretty neat for learning new words. I'd like to get it working for reading recitation, but that's a lot of work.

snadal · on Nov 26, 2016

Hey, thank you for the link to you article. I've read it throughly and I cannot agree more. And that was written two years and a half ago, before the AI "explosion" that we saw later.

Actually, checking against confidence is something that we've tried to play with, but to my knowledge there is not a model that allows you to compare speech confidence against an specific text. Public APIs like MS ProjectOxford.ai can return a confidence, but against the "recognised" text, not against a predefined text.

Going further, this kind of approach can be very effective on words and small sentences, but I'd really love to see which specific phones the learner is failing, which can help in analysing full speaking exercises.

It works, but I am sure it should be possible to do better

brandoncarl · on Nov 26, 2016

To the authors: did you any of your own recordings? I've used my own and clips online, in WAV and other formats, at various sampling rates.

All of the results come back gibberish. The results in the training data seem just fine. Curious if you've tested the above to ensure it didn't overfit.

craigbaker · on Nov 25, 2016

Is this really speech recognition from raw waveforms? It looks like they're extracting MFCC features from the raw audio, and using that as input to the neural network. I thought that the point of WaveNet was that it took the raw waveform directly as input, unlike previous architectures which first extract spectral features such as MFCCs to use as the input.

bmc7505 · on Nov 25, 2016

Apparently, they tried to use the raw audio waveform with the original setup from the WaveNet paper but couldn't get it to train on their TitanX, so they used MFCCs instead. It's not exactly clear why this is the case.

"Second, the Paper added a mean-pooling layer after the dilated convolution layer for down-sampling. We extracted MFCC from wav files and removed the final mean-pooling layer because the original setting was impossible to run on our TitanX GPU." [1]

[1] https://github.com/buriburisuri/speech-to-text-wavenet#speec...

RandomInteger4 · on Nov 25, 2016

How much Bandwidth is consumed from voice communications such as when speaking to someone on Skype or over the phone, vs. the same words transmitted via text?

Perhaps future communication applications can have a WaveNet on either end, which learns the voice of the person you're communicating with and then only sends text after a certain point in the conversation?

I'm coming at this from a point of ignorance though, so correct me if I've made erroneous assumptions.

dest · on Nov 25, 2016

text communication is much lighter (a few bytes/s vs kb/s) but you may miss the non verbal contents of voice

RandomInteger4 · on Nov 26, 2016

By non-verbal do you mean like ambient sound? Dogs barking, child yelling, garbage truck garbage trucking? I don't know. If they can do voice, then it might be possible to do ambient sounds of there is a separate nets trained with a library of ambient sounds where it's tuned not to be the same every time the sound plays like how when you have tiled graphics, there are algorithms that remove the unnatural sameness from one tile to the next.

This could have interesting implications for Foley-artists of the 21st century.

How likely would such a tech help lower budget companies who want to implement voice communication within their software, say for video games or similar?

Hmm, now this has me wondering what implications this has for voice acting as well.

EDIT: We can call the ambient sound symbols sent over the wire "Soundmojis" or "amojis" or "audiomojis"

dest · on Nov 27, 2016

I was thinking about the voice intonation. For example the sentences "this is really great" or "how do you do? -> I'm fine, thank you" can have opposite meanings depending on the intonation. This explains a lot of the misunderstandings on written forums.

It should be possible to train a neural network to catch those special intonations, but it is IMHO substantially harder than the initial project, with uncertain results.

RandomInteger4 · on Nov 29, 2016

Oh, right. I can't believe I forgot about intonation ... I should really get out and talk to people via voice more ...

kondro · on Nov 26, 2016

Less than 8kbps in most voice. It pales in comparison to the quantity of bandwidth consumed each day on video.

throwaway13337 · on Nov 25, 2016

This seems super useful for most speech recognition - understanding context.

It doesn't seem like the mainstream engines (Alexa, Google Voice, Siri) are context aware. Why not?

doublerebel · on Nov 25, 2016

Context involves location, which 99% of the time those bots don't take into consideration. Context does not involve knowing everything about your email or being able to search the entire web. It's much more connected to what you just did and where and when you are doing it.

This is what I'm solving at Optik. Helping you manage the things that you care about in the place that you are, and NOT exposing your personal details to cloud computation.

EGreg · on Nov 25, 2016

Also why can't we track emails sent from our iOS device like we can with desktop GMail plugins??

teajunky · on Nov 25, 2016

Wow train.py contains only 83 lines of code (including a few empty lines and commets). And recognize.py is only litte bit longer with 108 lines. Very impressive.

bra-ket · on Nov 25, 2016

typical of machine learning, a whole lot of talking about a few lines of code

hyperbovine · on Nov 25, 2016

FFT is 4 lines, what is your point.

IshKebab · on Nov 26, 2016

Can someone explain why MFCC is used rather than allowing the neural network to learn from the raw waveform? I looked back in the literature and the intention of MFCC & PLP seems to be to remove speaker-dependent features from the audio in order to reduce the dimensionality of the input. But I though the whole point of neural nets is that they can learn from very high dimensional inputs no?

I had a go at implementing wave->phoneme recognition using a simple neural net and it seemed to work pretty well.

Karlozkiller · on Nov 25, 2016

This is exactly what I would have wanted for my master thesis about half a year ago, where I wanted to use s2t with good control over the system without having to implement everything myself.

echelon · on Nov 25, 2016

Did the original WaveNet text to speech demo come with a paper or source code? (I didn't see either.) I'm interested in techniques, particularly neural network-related, to improve the quality of my Donald Trump text to speech engine [1].

Does anyone on HN do active research in this field? Could I pick your brain for a survey of the best papers (especially review papers) on the subject?

[1] http://jungle.horse

bmc7505 · on Nov 26, 2016

> Did the original WaveNet text to speech demo come with a paper or source code?

Paper, yes. [1] Source code, no.

[1] https://arxiv.org/pdf/1609.03499.pdf

londons_explore · on Nov 25, 2016

Looking at the training loss graph, it looks like training for more time would produce even better results...

Anyone want to volunteer a few weeks of GPU time to train this better?

gwern · on Nov 26, 2016

Training loss pretty much always decreases. NNs are extremely powerful models, so they can overfit most data. What you want to see is the validation loss graph.

mo1ok · on Nov 25, 2016

This is awesome. I was just reading the waveNet paper and wondering how would go about a DIY approach...

EGreg · on Nov 25, 2016

Does this require an internet connection, though? Relative to say OpenEars?

amelius · on Nov 25, 2016

Perhaps now finally Linux could get a speech recognition input device.