Unlike NSynth, synthem80 is directed to a specific and humble goal - make early 80s-style arcade sounds. It uses a mini-language to control an engine similar to that in Pacman.
I'm sorry but is the Deep Learning Hype strong enough to warp people's sensory perception? Every sample on this page sounds terrible IMHO, and pretty much what you would get if you would spend 10 minutes implementing the most naive spectrogram resynthesis you could think of. Granted, there is great promise in finding the "manifold of music", which seems to be the goal here, but what they show is just not anywhere near that promise.
Agreed. The texture is nice - I enjoy a low-fi sound - but the fun of sound engineering is building your own signal paths to modulate or destroy sound interactively. The more abstracted the sound generation method, the more of a toy and the less of a tool it is, because the rising non-linearities make it increasingly difficult to pursue a specific objective. This has alway sbeen a limiting factor for FM, where undirected noodling can certainly yield interesting results, but not very controllable ones beyond3 or 4 operators.
I do think it's interesting and valuable work. But it's worth bearing in mind that there's no shortage of great resynthesis tools already, and that musicians are besieged with offers from technologists for Sounds! That! Have! Never! Been! Possible! Before! While you can always rely on Jordan Rudess to provide a celebrity endorsement to the keyboard collector crowd, most hobbyist musicians eventually get over chasing novelty and end up reducing their equipment load to a smaller number of really well-engineered devices or software tools that they really like and get to know inside out.
I've read the articles about NSynth with interest, but I can't figure out why they're using 8-bit and low sample rates. Surely, it's not that much more computationally intensive that they can't tinker at 8 bits and then do a render at a high resolution once they've settled on some parameters they like.
Possibly the same reason all the Style Transfer implementations use very low resolution images? All the neural net applications I've seen seem to have problems with high resolutions in any form.
The 8-bit is actually reasonable: they have one output per possible value, so 16 bit would mean 65k outputs... They could probably do a secondary step that adds less significant bits. The low samplerate is probably because it's originally used for speech, and a lot of speech databases are in 16 kHz.
Yeah, granted there are neural resynthesis packages which do function, they are just waaay too slow for realtime audio production at the moment (and probably will be for a long time, now that moore's law is dead).
i feel stupid and do not get what this is all about. so there is something that synthesizes sounds by feeding it audio files? i dont get what is happening here. i tried semi hard to understand, but i figure someone can give the big picture that i think im missing.
Could this approach be used for media compression? I've wondered how compressible a popular-music track could be if you had a sufficiently rich language to describe it. This seems like a method to answer that question.
Or sheet music. It always amazed me that humans came up with any solution at all to "here's a piece of paper, tell me what your song sounds like" to say nothing of one that actually works to some degree.
I've always wondered how much classical music sounds the way it does because sheet music is the way it is.
An example of this is Chinese guqin tablature. It can be centuries old and includes a lot of detail on where to place fingers and how to strike the strings, which can give you hints about pitch and timbre when combined with knowing the tuning, strings, etc. But the tablature has almost nothing to say about the LENGTH of each note, so rhythm has to be inferred by the performer from what they know about the culture.
Program change + general MIDI instrument set is implementation-dependent but was pretty common in the 90s, and encodes timbre in an extremely limited way.
Now of course nobody outside of fringe artists really use it.
It calls to mind the old joke about how someone wrote a compressor that turns Microsoft Word from a 20MB file into a 1 byte file, except the compressor is 20MB. (Adjust the file name and size until it's funny. When I first heard it, 20MB was an extraordinarily large size.)
But in this case you could imagine the right balance where it does end up with a significant savings.
Would anything approaching typical bitrates used in audio codecs imply an enormous dictionary? Also I wonder if any statement could be made about the learnability of codecs, e.g., are Fourier transforms something deep networks can arrive at?
I'm just starting to learn tensorflow from a developer non-data-scientist view. This is great. From a laymen view, it seems it needs a training session for eliminating noise or static.
https://github.com/wildsparx/synthem80
Unlike NSynth, synthem80 is directed to a specific and humble goal - make early 80s-style arcade sounds. It uses a mini-language to control an engine similar to that in Pacman.
For instance, the sound when Pacman eats a ghost: