An AI system for editing music in videos

GistNoesis · on July 7, 2018

Link to the Official Page with paper and demo: http://sound-of-pixels.csail.mit.edu/

kr4 · on July 7, 2018

Does anyone know of anything similar which can extract human voice out of a video which has other noises including fans, people coughing, electric generator etc?

8bitsrule · on July 8, 2018

The human ear.

From what I can hear, most of what's being done is selective filtering. Leaving behind some artifacts and much less fidelity. Similar feats by human engineers have been around for decades. Analog filters/Fourier can be used to put frequency segments into channels (e.g. vocoding), then keeping the useful channels.

Jarwain · on July 7, 2018

I'd imagine one could apply this network to a video of someone speaking

augbog · on July 7, 2018

Yup I believe they announced this at Google I/O Keynote this year actually -- they mentioned that while it might not be enough with just the audio, looking at mouth gestures can give AI enough to know what might be said by whom.

https://www.youtube.com/watch?v=ogfYd705cRs&t=7m0s

pbhjpbhj · on July 7, 2018

Reflecting on your last phrase: I was watching Esports the other day, the player was talking with a very loose mouth and I wondered if he was avoiding being lip-read.

I imagine a lip reading AI would get wide usage. Managers will be wearing face masks to hide their lips. (It's probably doable now to listen with a spy-mic, but it's obvious in a way that using a normal video camera isn't).

laszlokorte · on July 7, 2018

Julia Probst is a german deaf blogger [1] who is famous [2] for lipreading tactical commands given by soccer coaches to their team during a match and posting them on twitter. She is even hired by sport channels on TV to provide the commentator with insight information.

[1] https://twitter.com/einaugenschmaus [2] http://www.sueddeutsche.de/sport/lippenlesen-im-fussball-die...

gtani · on July 9, 2018

You can google "source separation" to get background on this, Stanford has what is supposed to be a nice lib.

https://www.reddit.com/r/MachineLearning/comments/4oewdq/rnn...

https://www.reddit.com/r/MachineLearning/comments/4r92iq/wou...

https://www.reddit.com/r/MachineLearning/comments/66j2i4/p_i...

tgp1 · on July 7, 2018

This might be unrelated but i want to ask this. When on a phone call, many times the sounds of peripheral objects (traffic, horns, fan sounds, keyboard clicks) seem more obvious than the voice of the person. Do you feel feel the same? Is there a scientific explanation for this?

hammock · on July 7, 2018

Many phone systems use some form of automatic ducking, meaning it tries to identify when someone is speaking, and raises their volume while lowering the volume of everyone else. The objective of this is to increase overall intelligibility but it's not perfect. A sudden change in volume on one line, e.g. caused by a car horn or keys typing, can trigger it to think someone else has started speaking

peterlk · on July 7, 2018

I am so excited to see this. I've been waiting for the day that we get a tool that can extract various instrumental parts from music. This isn't there yet, but it's the right direction. If we can get there, music copyright will have another good fight on its hands.