Hacker News new | past | comments | ask | show | jobs | submit login
An AI system for editing music in videos (news.mit.edu)
102 points by benryon on July 7, 2018 | hide | past | favorite | 11 comments



Link to the Official Page with paper and demo: http://sound-of-pixels.csail.mit.edu/


Does anyone know of anything similar which can extract human voice out of a video which has other noises including fans, people coughing, electric generator etc?


The human ear.

From what I can hear, most of what's being done is selective filtering. Leaving behind some artifacts and much less fidelity. Similar feats by human engineers have been around for decades. Analog filters/Fourier can be used to put frequency segments into channels (e.g. vocoding), then keeping the useful channels.


I'd imagine one could apply this network to a video of someone speaking


Yup I believe they announced this at Google I/O Keynote this year actually -- they mentioned that while it might not be enough with just the audio, looking at mouth gestures can give AI enough to know what might be said by whom.

https://www.youtube.com/watch?v=ogfYd705cRs&t=7m0s


Reflecting on your last phrase: I was watching Esports the other day, the player was talking with a very loose mouth and I wondered if he was avoiding being lip-read.

I imagine a lip reading AI would get wide usage. Managers will be wearing face masks to hide their lips. (It's probably doable now to listen with a spy-mic, but it's obvious in a way that using a normal video camera isn't).


Julia Probst is a german deaf blogger [1] who is famous [2] for lipreading tactical commands given by soccer coaches to their team during a match and posting them on twitter. She is even hired by sport channels on TV to provide the commentator with insight information.

[1] https://twitter.com/einaugenschmaus [2] http://www.sueddeutsche.de/sport/lippenlesen-im-fussball-die...



This might be unrelated but i want to ask this. When on a phone call, many times the sounds of peripheral objects (traffic, horns, fan sounds, keyboard clicks) seem more obvious than the voice of the person. Do you feel feel the same? Is there a scientific explanation for this?


Many phone systems use some form of automatic ducking, meaning it tries to identify when someone is speaking, and raises their volume while lowering the volume of everyone else. The objective of this is to increase overall intelligibility but it's not perfect. A sudden change in volume on one line, e.g. caused by a car horn or keys typing, can trigger it to think someone else has started speaking


I am so excited to see this. I've been waiting for the day that we get a tool that can extract various instrumental parts from music. This isn't there yet, but it's the right direction. If we can get there, music copyright will have another good fight on its hands.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: