Microsoft Research make breakthrough in audio speech recognition

acqq · on June 22, 2012

The most interesting bit for me is at the end of another blog entry:

http://blogs.technet.com/b/inside_microsoft_research/archive...

"An intern at Microsoft Research Redmond, George Dahl, now at the University of Toronto,

http://www.cs.toronto.edu/~gdahl/

contributed insights into the working of DNNs and experience in training them. His work helped Yu and teammates produce a paper called Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition.

http://research.microsoft.com/pubs/144412/DBN4LVCSR-TransASL...

In October 2010, Yu presented the paper during a visit to Microsoft Research Asia. Seide was intrigued by the research results, and the two joined forces in a collaboration that has scaled up the new, DNN-based algorithms to thousands of hours of training data."

gdahl · on June 23, 2012

For people interested in some (currently) undocumented research code in python implementing DNNs that is also on my website. Although the code is only an initial release. I will improve it later, but if I waited until it wasn't embarrassing I would never release it, so I just posted it.

pathdependent · on June 23, 2012

Thank you for doing so!

Overwhelmingly, it is my experience that researchers in computational disciplines publish papers with half-finished code "available on request" -- and requests are often ignored. It's refreshing to hear someone say, "Yes, the code needs work, but it should be available."

kpozin · on June 22, 2012

The demo site (http://www.msravs.com/audiosearch_demo/) blocks browsers other than IE and Firefox based on the user agent string. Use WebKit's developer tools to change your user agent and you'll be able to get in.

no_more_death · on June 22, 2012

Why alienate such a large segment of users after pouring so much money into their technology? The web is getting weirder.

If a company invests in multiple markets, they should be prepared to do well in some markets and badly in others. Bing isn't as good as Google. Android isn't as well-designed as Metro. Yes, Android stole Apple's market, and, yes, Apple stole someone else's market. The large technology companies are deadlocked on multiple fronts. That fuels fierce competition and inspires excellence and choice. However, companies should accept they just aren't the best at everything. Let us make our own choices based on what's best for us.

georgemcbay · on June 22, 2012

I think you are attributing to malice what is probably just laziness. It is fairly common for modern websites to drop the ball on support of some browser or other. I doubt Microsoft as a corporation made a deliberate decision to support IE and Firefox but not Chrome or Safari or Opera or whatever.

kpozin · on June 22, 2012

It's one thing to not test a site in a particular browser and to just put up an unobtrusive warning saying that some things might not work perfectly. It's quite another to actively block access based on the user agent string.

duaneb · on June 22, 2012

Chrome is more popular than Internet Explorer is!

epo · on June 22, 2012

"Android stole Apple's market" eh? Must be why Apple is losing money hand over fist. Apple fans can be annoying but fandroids are often detached from reality.

sp332 · on June 22, 2012

Android market share is increasing at the expense of iPhone market share. That doesn't mean either of them are suffering. http://arstechnica.com/gadgets/2011/04/developer-frustration...

richardlblair · on June 22, 2012

Imagine the power of this for students. This would have made school so much easier. Simply record every lecture and then use this to search for keywords.

Awesome.

toemetoch · on June 22, 2012

On a different note, imagine the power of this for DRM control and censorship.

But impressive and very useful.

nddrylliog · on June 22, 2012

What? What do speech recognition and fingerprinting have in common? I don't see how this research applies for DRM...

Censorship, maybe. And even then, you can't filter conversations in real-time, only maybe 'flag' people with forbidden words.

toemetoch · on June 22, 2012

Pretty much all DRM content has unique patterns.

Want to prohibit videos of the Starcraft game? Simple search for a few sentences like "more vespene gas" and "require more minerals".

Want to find online copies of "Aliens"? Just enter a few catchphrases or part of a dialogue like "They come mostly at night. Mostly."

And even then, you can't filter conversations in real-time, only maybe 'flag' people with forbidden words.

Yes, that's really assuring to know it's not in real time.

lt · on June 22, 2012

Those don't really require textual matching, just regular audio fingerprinting. In fact, doing that would match Starcraft or movies podcasts, where people are quoting the source.

toemetoch · on June 22, 2012

With audio fingerprinting the content provider must provide a way to fingerprint its own audio and have access to fingerprints of the internet's audio/video. This means a partnership between e.g. youtube and a studio. I'm fairly sure this involves studios above a certain size, resources for programming+API and a fair bit of paperwork and testing for robustness as there are ways to mess with the technique.

With this technique you just enter a few words and look at what comes out.

You're suggesting that the first option is easier?

lt · on June 22, 2012

Yes. Not only easier, but more reliable. The examples you gave are perfectly static sound bits - they don't change. It doesn't make sense to transcribe them to text, just match the audio. Soundhound/Shazam/etc do this easily. I'm pretty sure YouTube has some kind of similar mechanism already in place.

This technology gets a lot more interesting if you want to search for people talking about you or your products.

MichaelGG · on June 22, 2012

On a immediately useful practical note, OneNote also contains this functionality (obviously not as powerful). I've used it to record a meeting's audio sync'd to my notes, and then be able to search the audio to jump exactly to where someone mentioned something and review context. Saved my ass on at least one occasion.

bornhuetter · on June 22, 2012

Can someone please explain senones to me? Can't find much on Google.

The article says that they are a fragment of a phoneme, but how small a fragment are we talking? 2-3 per phoneme, or many more?

Also - I'd be curious how much the phoneme in a word can vary based on accent.

thund · on June 22, 2012

http://cmusphinx.sourceforge.net/wiki/tutorialconcepts

"Speech is a continuous audio stream where rather stable states mix with dynamically changed states. In this sequence of states, one can define more or less similar classes of sounds, or phones.

Words are understood to be built of phones, but this is certainly not true. The acoustic properties of a waveform corresponding to a phone can vary greatly depending on many factors - phone context, speaker, style of speech and so on. The so called coarticulation makes phones sound very different from their “canonical” representation. Next, since transitions between words are more informative than stable regions, developers often talk about diphones - parts of phones between two consecutive phones. Sometimes developers talk about subphonetic units - different substates of a phone. Often three or more regions of a different nature can easily be found.

The number three is easily explained. The first part of the phone depends on its preceding phone, the middle part is stable, and the next part depends on the subsequent phone. That's why there are often three states in a phone selected for HMM recognition.

Sometimes phones are considered in context. There are triphones or even quinphones. But note that unlike phones and diphones, they are matched with the same range in waveform as just phones. They just differ by name. That's why we prefer to call this object senone. A senone's dependence on context could be more complex than just left and right context. It can be a rather complex function defined by a decision tree, or in some other way."

bornhuetter · on June 22, 2012

Thanks. So senones are not just fragments of phones - two senones could sound exactly the same, but be classified differently depending on their context within the audio stream.

gdahl · on June 23, 2012

Senones are just tied triphone HMM states. A context dependent HMM recognizer has a 3-5 state HMM for every context dependent phone. Conceptually, each different HMM state in each different phone HMM has its own Gaussian mixture model, but this is awful because many of them don't get much data assigned to them. So people share parameters for different HMM states based on a data driven decision tree that clusters states together. Those clustered or tied states are sometimes called senones.

Dn_Ab · on June 22, 2012

For those keeping score, google's image feature extractor shares the same core principles as microsoft's speech recognizer.

EDIT: by keeping score I mean keeping track of which techniques are being used where.

no_more_death · on June 22, 2012

Am I the only one who gets tired of people keeping score like this? Can't we just accept that many of the large companies are seriously innovative?

(Sorry, I know I'm being cranky)

mikedmiked · on June 22, 2012

What are these core principles?

Dn_Ab · on June 22, 2012

The main characters of both papers are many layered neural network architectures, autoencoders and stochastic gradient descent. The interesting thing is that all these ideas are from the 80s but the breakthrough was in how to use unsupervised learning to seed neural networks so that a many layered neural network did not get mired in local optima.

The key idea is that if you train each layer in an unsupervised manner and then feed its outputs as features for the next layer it performs better when you go on to train it in a supervised way. That is, back-propagation on the pre-trained Neural net, learns a far more robust set of weights than without pretraining. Stochastic gradient descent is a very simple technique that is useful for optimization when you are working with massive data.

The architecture Dahl used layers as RBM (very similar to autoencoders) to seed a regular ole but many layered Feedforward network. SGD is used to do back propagation. RBMs themselves are trained using a generative technique - see Contrastive divergence for more.

The google architecture is more complex and based on biological models. It is not trying to learn an explicit classifier hence they train a many layered autoencoder network to learn features. I only skimmed the paper but they have multiple layers specialized to a particular type of processing (think photoshop not intel) and using SGD they optimize an objective that is essentially learning an effective decomposition on the data.

The main takeaway is if you can find an effective way to build layered abstractions then you will learn robustly.

tylerhobbs · on June 23, 2012

There's a very good Google Tech Talk by Geoff Hinton (who has worked closely with Dahl on a lot of this research and developed some of the key algorithms in this field) that explains how to build deep belief networks using layers of RBMs: http://www.youtube.com/watch?v=AyzOUbkUf3M

That video focuses on handwritten digit recognition, but it's great for understanding the basics. There's a second Google Tech Talk video from a few years later that talks directly about phoneme recognition as well: http://www.youtube.com/watch?v=VdIURAu1-aU

aheilbut · on June 22, 2012

Do you mean the system behind the google images search?

droz · on June 22, 2012

Research paper on the system: http://www.se.cuhk.edu.hk/hccl/publications/pub/HLT2006.pdf

brutuscat · on June 22, 2012

This seems very related to this http://www.youtube.com/watch?v=ZmNOAtZIgIk speak by Andrew Ng. It is a 40min speak, but he explains very simply how all this works for images and some examples about the audio case. It is incredible how using this deep learning techniques we can teach this "neural networks" to recognize such complicated patterns. It is like reverse engineering the brain's algorithms.

BTW I took his Coursera's course about Machine Learning and it was great! I also recommend it A LOT to gather basic ML knowledge.

JabavuAdams · on June 22, 2012

Are you still able to access the course materials? I took the course as well (and enjoyed it!) but I'd like to access the PDFs, especially.

brutuscat · on June 24, 2012

Yes I have downloaded all the PDFs, email me (check my profile) I will share it via Dropbox for you ;)

tsumnia · on June 22, 2012

How does this compare to Microsoft's Old HTK (HMM Toolkit)? The language used on the website seems to point to a lot of the same things. Is this breaking it down to actual IPA phonemes?

I'm mostly curious because I used the HTK for my thesis and would like to know how they compare (besides, one being just 'newer').

ezy · on June 22, 2012

This approach still uses HMMs, it's just that the observation probabilities are now coming from a DNN (neural network) instead of a GMM (gaussian mixture model). "Senones" are not new, HTK can use various context dependent phoneme models, and the HMM states (typically 3) within each context dependent phoneme essentially boil down to what they call a "senone" here. Interestingly, they use GMM's to bootstrap the DNN training -- which I suppose you could avoid once you have a reasonable DNN laying around.

The main difference here is hooking DNN output to an HMM decoder, replacing GMMs, and possibly even more important the training process they use to get the DNN fairly efficiently. That's the biggest thing -- GMMs, at least the last time I've looked, can be trained and adapted much quicker than a DNN.

Wilya · on June 22, 2012

(I'm not an expert)

I think the HTK doesn't use neural networks at all. What it does is simply computes the MFCC of the sound signal and use it as input to a chain of HMM models. Well, "simply" that, plus the dozens of refinements and tweakings to make that work well.

Here, I guess they do some sort of preprocessing on the sounds features using their deep neural networks before feeding the whole thing to the HMMs.

cmicali · on June 22, 2012

Vlingo, Siri, and others have been doing speaker independent auto-adapting speech recognition for years and talking about systems requiring 'training' and improvements there sound like this article is 5 years old. Great to see innovation in this space but this article is very light on detail.

breckinloggins · on June 22, 2012

It is my understanding (albeit based on limited knowledge) that Siri, like other Nuance-powered systems that make a call to the server, are actually "trained" continuously by the huge amount of sample speech they receive by real users.

The true "breakthrough" here would be if Microsoft made a voice recognition system that could run entirely on a device (no internet connection needed) and accurately understand speech without terabytes of training data or a local user training session. I can't tell from the article if this is what Microsoft is claiming.

Also, it appears that "Deep Neural Network" isn't the most common term of art here. DNN appears to be a synonym for "Deep Belief Network".[1] Can anyone confirm?

[1] http://www.scholarpedia.org/article/Deep_belief_networks

rck · on June 22, 2012

I believe that in this system, "deep neural network" just means a regular feed-forward network that has a larger number of hidden layers. There is a relationship to DBNs though, because they initialize the weights of the neural net by doing unsupervised pre-training with a set of DBNs.

gdahl · on June 23, 2012

The term "Deep Belief Network" has been abused in the literature (not pointing fingers, I've done it too). The DNNs used mean a neural net pre-trained with RBMs. Sometimes, when people say DBN, that is also what they mean. But really a DBN is a particular graphical model with undirected connections between the top two layers and directed connections everywhere else. The confusion comes from the pre-training procedure. The pre-training creates a DBN, which is then used to initialize the weights of a standard feedforward neural net. Then the DBN is discarded. It is a somewhat pedantic distinction. Since DBN is already an overloaded acronym (Dynamic Bayes Net) in the speech community and not entirely accurate for the pedantic reason I just mentioned, we decided to go with the DNN acronym.

ezy · on June 22, 2012

As you might guess, they are not claiming this.

They basically are using a new (in the context of speech rec) technique that seems to improve accuracy by 16% relative on their test data (and using their code :-)). It's a really great result, but it doesn't change the basic nature of a state of the art speech recognizer at all -- you still need to train and adapt it -- and it still needs lots and lots of data.

mikeash · on June 22, 2012

That jumped out at me as well. Speaker-independent systems are most certainly not limited to small vocabularies or pre-baked input patterns anymore. There is certainly room for a great deal of improvement, but it's in accuracy, not simply the ability to do generalized speaker-independent input at all.

knewter · on June 22, 2012

Vlingo has LITERALLY never gotten anything I said right, ever. Just a data point.

thund · on June 22, 2012

related link: http://research.microsoft.com/en-us/news/features/speechreco...

Nogwater · on June 22, 2012

related link comments: http://news.ycombinator.com/item?id=2936371