Noob question: It intuitively seems to me that by feeding raw text of a structured format (such as the music notation in the article) we're making the algorithm unnecessarily learn the syntax in addition to interesting stuff, which is the high level musical patterns. What kind of results would you expect from running the same experiment, but with an input encoding more specialized to the problem domain? Would the performance benefits be significant?
As far as music notation goes, this is very unstructured notation--asides from the initial metadata, the data boils down to hitting one of 7 notes of the scale across 2 octaves, with an optional length parameter. Reading the full documentation indicates there's more complexity, but the RNN appears to have used none of it.
That the result is largely consonant appears to primarily due to the structure of the format: if you don't know that you can create accidentals off of a normal scale, play chords, create multiple competing melodies, etc., then you can't use them to create dissonance. This means that a pure random string largely amounts to hitting random white keys on a piano, which never really sounds grating (although it doesn't sound good either, just bland)... which also happens to describe the samples that have been selected for us!
We want the model to be better right? I just want to express input data using as little bytes as possible. Do you mean that if we are able to somehow express in 1 byte what used to take 100, that you wouldn't expect it to noticeably improve the model?
Not appreciably. Neural Networks' job (in some configurations) is to approximate their input in as few bytes as possible. There's a vague equivalence between machine learning and lossy compression.
I wondered the same thing. If the encoding was completely dense such that all possible outputs were valid, would the network express more complex compositional behavior by avoiding the need for binary feedback about syntax validity?
I think the main benefits of the abc midi syntax are that it is already a very dense encoding and has an established library of music to train on.
A whole time back I created a program which uses genetic algorithms to generate melodies. I used the generated melodies as inspiration for music composition. The idea was (never implemented though) to add a fitness function based on training a neural network which could be trained by looking at other melodies or user input. More information can be found here: http://jcraane.blogspot.nl/2009/06/melody-composition-using-...
Another (hilarious) tool which demonstrates the difference in quality of lightly-trained RNNs and strongly-trained RNNs is RoboRosewater, which generates Magic: the Gathering cards using networks of varying quality/sanity, indicated by the card art: https://twitter.com/RoboRosewater
The authors of these music-generators should submit some of the compositions to online music libraries, to song competitions, etc, and see if they get accepted! "A la" what happened, back in the day, with peer-reviewed journals and: http://www.elsewhere.org/journal/pomo/
I am a neural network noob and only know the basic feedforward network.
So the training set is just text files containing songs? How does it test if the output is correct or not? If I understand correctly the goal here was just to produce outputs in the correct format. If one wanted to train for quality as well would one need to grade every output the network produces by hand?
In the training phase the network learns to predict the next note from the past few beats. As training data it uses a bunch of text formatted midi files. In generative mode, they just feed back the generated notes into the network at each step. The same approach can be used to generate DeepDrumpf[1] or any kind of time series.
I believe Karpathy Char-RNN is from 2015 (that is at least the year on the 100-line gist and the "unreasonable effectiveness of ..." blog post[0]) but char-level RNN language models dates back to at least 2011 with [1]
So I don't think we should call him the inventor, though he definitely popularised it with his great writing and examples.
[0] http://karpathy.github.io/2015/05/21/rnn-effectiveness/
[1] Sutskever, Ilya, James Martens, and Geoffrey E. Hinton. "Generating text with recurrent neural networks." Proceedings of the 28th International Conference on Machine Learning (ICML-11). 2011.
These things generate tunes which sound OK for a few seconds, but after tens of seconds, you realize there's no higher level structure at all. It's just random.
The music came out quite fun sounding. I could almost imagine hearing at least passages within the music in videogames. Perhaps some old-school zelda/jrpg game which would suit the folky quality of the music.
The bass line was generally quite simplistic, I wonder what happens if you codified gradus ad parnassum and taught the RNN counterpoint[0]
Fux wouldn't apply here because the bass/chord accompaniment shapes are idiomatic, and Fuxian counterpoint wouldn't be.
In fact this kind of toy ML can never produce competent music. The best it can do is produce short workable snatches that sound like cut and paste snippets of the training source - before losing the plot in the next bar or two.
Training an RNN on a huge set of tunes and expecting it to produce examples of equivalent musicality is a fundamentally unworkable idea.
I'd suggest that anyone who doesn't see why this must be true doesn't understand ML well enough to know when it can and can't be used effectively.
It's worth asking in what other domains are trivial RNNs being misapplied to produce trivially poor models.
It's one thing to make bad music. It's another to - say - run a trading strategy, or make marketing decisions based on oversimplified ML models that produce misleading results because they're not sophisticated enough to recognise all the critical structures in the data set.
Is there a better music database to work from for music generation? I'm surprised there isn't a massive db of 19th century sheet music or player piano rolls somewhere.
Musescore has a user database that has lots of music (no song count that I could find quickly, but I want to say ~10s-100s of thousands) that ought to be convertible to appropriate formats.
IMSLP (<http://imslp.org/wiki/Main_Page>) has about 110,000 works uploaded. However, these are mostly PDFs (some of them are scans), which would be very hard to extract useful data from.
Such data may include multiple voices, which makes it harder for a neural network to learn a pleasant-sounding song.
The patterns that RNNs tease out of training data seem to transcend mimicry. An artist may compose a song that sounds just like another song, but it is not simply mimicry, it is influence. It is a learned set of patterns that are composed somewhat randomly. The human mind has this illusion that what it hears is subjectively good, but that intuition is built out of experiences just like the RNN.