The Visual Microphone: Passive Recovery of Sound from Video

jsprogrammer · on March 24, 2015

Patent pending? The approach should be trivially obvious for anyone familiar with the basics.

What happened to the MIT License and liberating knowledge, does MIT not do that anymore?

joosters · on March 24, 2015

"It is a rare mind indeed that can render the hitherto non-existent blindingly obvious. The cry 'I could have thought of that' is a very popular and misleading one, for the fact is that they didn't, and a very significant and revealing fact it is too."

tormeh · on March 24, 2015

It is a very good idea, but there's another problem at hand: It's not novel. I remember hearing about this many years ago, though that was using vibrations in windows to get the sound inside a building. IMO there's nothing to patent here. Unless this is the same group it's just appropriating someone else's work.

Besides, random people do think of cool stuff all the time. They just don't normally patent it or start a business based on it. I thought of real-time music streaming to phones as a subscription service way before Spotify was a thing, but there wasn't much that 15 year old me could do about it. To this day I still have no idea how I would have gone about with a similarly good idea if I got one again.

sp332 · on March 24, 2015

It was a plot point in the movie Eagle Eye, which came out in 2008. Maybe there's something more specific in the patent though.

And I don't know if there was a real service for this, but the idea of music streaming to phones is pretty old. Peter Shickele used it on his parody album Two Pianos are Better than One which came out in 1994. It was called "Inter-Ear TelecommuniCulturePhone - Trademark!"

otakucode · on March 24, 2015

>vibrations in windows to get the sound inside a building

That is a totally unrelated problem. Recovering sound from a series of still photos compared to recovering sound from vibrations is a completely different issue altogether. Actually executing and doing it is even more different, as you run into all the implementation issues and physical bugaboos. When someone doesn't just have some vague notion, but actually implements a wholly new technique to do something previously impossible, whether it was imagined by others or not, is a patent really that absurd?

FroshKiller · on March 24, 2015

The key difference is that a laser microphone is amplifying the actual vibrations caused by the sounds as they're being produced. This is using silent video and knowledge about the acoustic properties of the depicted objects & environment to simulate/recreate the sound that was not captured. You can't point a laser microphone at the past.

jsprogrammer · on March 24, 2015

I remember the same thing being demonstrated on a mass-market network in the late 90s / early 00's. It was probably the Discovery channel, or similar, and likely a 'spy gadget' type show.

I'd guess someone in the military or intelligence community implemented initial prototypes not long after LASERs became available.

jsprogrammer · on March 24, 2015

I can only work on one thing at a time. Just because I didn't implement this, it only really means that I prioritized implementing something else over this.

Do you believe that implementation has no time-cost?

Also, I don't think your appropriated quote addresses the implications of the MIT License, its history and associated institution, or the liberation of knowledge.

logicallee · on March 24, 2015

Don't mean to put too fine a point on it, but could you give me any example of an invention throughout history that you would not class as obvious? I'm wondering what your standards are.

Geee · on March 24, 2015

You can't patent an idea. The patent is for the method of extracting sound from tiny vibrations in a video. The method is novel and non-obvious, and follows from their work in motion magnification.

kaoD · on March 24, 2015

Previous discussion: https://news.ycombinator.com/item?id=8131785

gus_massa · on March 24, 2015

I'll partially copy two interesting comments from that thread:

> [A video with] 60 frames per second only allowed to identify the speaker and the number of people in the room.

> The demo in the video is based on a high-speed (1000+fps) recording by a special camera, not on 'normal' video.

Gracana · on March 24, 2015

The first set of demos in the video, anyway. At the end they show the results of a special technique that uses 60fps video.

GuiA · on March 24, 2015

A standard mobile phone today can do 240fps. We're only a few years away from 1000+ for in everyday devices.

Potando · on March 24, 2015

Really? Isn't speed limited by the amount of light that can be received in a small lens like on a phone?

wyager · on March 24, 2015

>Isn't speed limited by the amount of light that can be received in a small lens like on a phone?

It's also limited by sensor noise and efficiency.

Presumably, we'll push all three of these limits over the next few years.

sp332 · on March 24, 2015

The trick is to add the signal of many pixels together, forming a larger effective sample. The output video is very low-resolution and has a low dynamic range after noise removal, but can get hundreds of FPS in decent lighting.

downandout · on March 24, 2015

Well that's pretty scary. I can see one application of this already: casinos in the US are legally prohibited from recording audio on their floors, but have perfectly positioned cameras everywhere. Beyond that, I'm guessing every spy agency on earth will be buying solutions based on this.

It would be interesting to know what the genesis of this project was - for example if the NSA or CIA was involved in suggesting to a professor that MIT take a look at this area. This is a very mission-specific technology.

otakucode · on March 24, 2015

This same researcher has been working in this area for awhile, and this is not an unexpected extension of things he's developed previously - like how to take standard video and determine the pulse of any people present (search on 'non-eulerian video magnification'). He's been working in what you can recover from the variations present in video which others have written off as 'noise' for awhile. I wouldn't be surprised if he just stumbled upon the effect sound had while doing other analysis.

I haven't looked, but if the NSA was involved with this, its usually easy to find out. Just look for any grants involved and look up the source of the grant. They don't usually hide their involvement in funding research. Pretty much any study done in the past 10 years about manipulating social graphs was funded by the NSA.

Anechoic · on March 24, 2015

This is a very mission-specific technology.

No, it's really not, there a ton of engineering and scientific uses where it may be useful to measure acoustic emissions off a vibrating surface, but it may be infusible or too resource intensive to attach/deploy conventional acoustic transducers. For example, characterizing sound sources in moving vehicles like trains (which currently require microphone arrays and a lot of post-processing) or wind turbines (which require expensive sound intensity measurement equipment).

If the cost of high-speed cameras come down, this could be a valuable alternative.

lnanek2 · on March 24, 2015

Spy agencies already have a laser they can put on a window to recover audio from the vibration and other similar devices. This isn't really that different. It seems behind current spook hardware, honestly.

anfractuosity · on March 24, 2015

But this is passive, which would undoubtedly have advantages to spies

catshirt · on March 24, 2015

could you do it the other way around?

how accurately can we recreate a 3d space from sound? what assumptions/information would you need to make it more accurate?

sjtrny · on March 24, 2015

Yes. Look at the setup/calibration involved with Soundbar type audio systems.

catshirt · on March 24, 2015

awesome, thanks for the lead!

i will look it up, i am mostly curious about it's resolution. for instance, my unqualified hunch is that the algorithm couldn't detect the size of the dog in my room based on a microphone recording.

i guess the more calibration involved the easier the problem becomes. but that is no fun. :)

blt · on March 24, 2015

William T. Freeman is an outstanding vision researcher. His list of publications (http://billf.mit.edu/publications/all) is full of these simple, clever solutions for problems slightly outside the mainstream. I really admire his work.

yshalabi · on March 24, 2015

Rubinstein was also behind the work on using pixel intensity variations to visualize... stuff. They used it to extract heart rates. I am guessing sillier methods used, but now to recover induced vibrations due to sound. Interesting work.

baldfat · on March 24, 2015

Someone give this to the writers of crappy TV that uses the enhanced photo line I am sure they will flip out at the whole new story lines created by this.

CSI has forever been changed. Bet you it is on next season on multiple of TV crime shows.

sjtrny · on March 24, 2015

This is an interesting extension of the ideas from this paper by the same author/s http://people.csail.mit.edu/mrub/vidmag/.

Animats · on March 24, 2015

They're making progress. At 5000 FPS, it's not surprising that they can recover audio. But from 60 FPS, that's striking. That works because some imagers don't take the whole frame at once.

dheera · on March 24, 2015

Almost all consumer cameras have rolling shutters. In fact for this experiment, the crappier the camera, the better, as it's less pronounced in a lot of higher-end cameras. I'd suspect they might even be able to do better sound recovery with a GoPro than a DSLR.

bsder · on March 24, 2015

And I suspect that you could combine video feeds from several lower speed cameras to give you an effective 1000FPS.

jobigoud · on March 24, 2015

Yes but you have to precisely stagger the start of expositions of each camera and it's hard to do on consumer hardware.

bsder · on March 24, 2015

Um, why would you have to precisely stagger?

I suspect that you have enough information to actually align the videos after the fact.

10 videos at 250FPS would probably distribute sufficiently.

Vulkum · on March 24, 2015

Would it not be hard to interleave the first frame of these videos given different starting times and angles (ignoring camera movement)? It should be easy if the videos have synchronized timestamps, but that might not always be the case.

bsder · on March 24, 2015

Any in-frame motion probably allows you to align to frame after the fact. This is existing technology, and gives you timestamp to frame alignment.

If you are reconstructing sound, you can now fuzz the time alignments to give the maximum signal for the maximum time (non-correlation will damp to random noise quickly). This allows you to pairwise reconstruct time alignments.

At that point, you put them all together and run your detailed analysis.

Now, I didn't say this way EASY. :) Or cheap. Or real-time.

Just that it is possible.

yummybear · on March 24, 2015

I wonder how this would perform on iPhone 6's (or others) high speed camera.

Naushad · on March 24, 2015

Slowly, the creative imaginations coming to life. Eagle Eye....