Very cool! One minor nitpick -- the author mentions that this is 'completely unsupervised'. It's true that the author didn't need to manually classify the data, but someone did.
So, I believe that this is actually supervised learning, as the author is training a classifier on preexisting labels (the genres).
I believe that unsupervised learning would not make use of a target variable at all. If the network architecture terminated at the fully connected layer, and then propagated that layer backwards to reconstruct the input (something like Contrastive Divergence), that would be an unsupervised method.
You're correct of course. But it's cool that you can learn a useful embeddings (in this case into a 128-dimensional space) with only relative few (in this case 9) binary labels.
In my opinion, the results are not quite exciting as they might seem like at the first glance. The hip-hop and minimal house classification perform almost randomly (the random classifier would have accuracy of 50%). The claim of music genre subjectivity is not fully appropriate for the categories used in this work: the presented genres are quite distinct, and they have objective differences. Knowing only BMP and rhythm structure of the tracks would be sufficient to classify most of the mentioned genres. Also, the article lacks of critical analysis of the results. The network may not have learned to analyze structural properties of the music; if this is true, than what is it classifying exactly? An averaged spectral envelope or spectral distribution? In this case the network will fail if you feed a filtered music piece into it. There is a nice paper on issues like these called “A Simple Method to Determine if a Music Information Retrieval System is a Horse”, you may want to check it out: https://www.researchgate.net/publication/265645782
I understand this is an educational project, but nevertheless it's published, hence open for critics ;)
"The hip-hop and minimal house classification perform almost randomly (the random classifier would have accuracy of 50%). "
You are assuming that this is a series of binary classifiers. It is multiclass classification, so the base rate for nine classes is 11%.
I agree that the resulting application is rather primitive. Although it was interesting for me, as someone who just learned the theory behind ML, to see how an ML application is built from front to end.
I expect that real world applications would encompass a lot of knowledge, which you normally learn after you developed the first version of the app and started using it. I wonder if there are articles out there, which share ordered and filtered information on that.
> It did a really good job classifying trance music while at the other end of the scale was hip hop / R&B with 61%, which is still almost 6 times better than randomly assigning a genre to the image. I suspect that there’s some crossover between hip hop, breakbeat and dancehall and that might have resulted in a lower classification accuracy.
The first step to analyze this is to make a confusion matrix, [1]. It would be nice if the article included it.
This is interesting, but fairly easy to confuse. Esp. would be interesting to see what results come up when you use modified "artistic" spectographs like that of Windowlicker by Aphex Twin [1]. One thing I've learned from years of having worked with audio and images is that image representations of audio are horrible representations of it (other than for temporal changes).
It also doesn't help that music from every genre is becoming more homogenized as time goes by [1]. If your comparing by similarity, then this is only going to get more difficult.
> image representations of audio are horrible representations of it
The spectrogram is just a series of FFTs taken over time; encoding it as a bitmap doesn't really change this, aside from precision issues.
Any other representation of the audio is derived from either the original time-domain signal or the FFT.
Indeed, humans can't reliably map raw waveforms or spectrograms to intuitive musical phenomena.
But a CNN should be able to derive meaningful features from these basic representations on its own.
I'm curious to hear more about why image representations of music are horrible. What are the problems or limitations? Is there a better way to perform a similar kind of "dimensionality reduction" on music?
The greatest value of a music recommendation engine, IMO, is cross-genre discovery.
The history of recording industry "Genres" has close ties to cultural segregation. Pandora's Music Genome approach is optimized to break the genre barrier.
It'd be interesting to see how many "Down tempo" songs shared characteristics with "R&B", for example. I think the Author's approach could still be applied.
This is a really cool project. The hardest part of DJing is knowing which set of songs have similar sonic profiles, and would mix well together. I would love to see this put to use in personal music collections, or in a Traktor playlist, and be able to sort songs by their similarity.
Yeah, sorry, old-timer here. I loathe "genres" and strip them off my purchased music.
Is R.E.M. "Alternative", "Rock", "College"? Maybe you consider an album like "Reckoning" from R.E.M. "Rock" but then it includes a track like "Rockville" that is perhaps "Country"?
Genre makes sense for "Soundtrack" or perhaps "Classical"? But beyond that it's just mental gymnastics.
And given how fondness for music is qualitative, I've always been suspect of any sort of algorithm that tries to recommend music based on fast-Fourier-transforms. Maybe AI isn't for everything....
http://benanne.github.io/2014/08/05/spotify-cnns.html (Recommending music on Spotify with deep learning) uses CNNs trained on spectrograms + similarity data from collaborative-filtering to predict per-song vectors.
Interesting. You didn't specify, I'm guessing you did 3x3 convolutions on the spectrographs? Also, how did you choose the convolution size, number of conv/pooling layers, etc? Did you consider asymmetric convolution/pooling layers to account for the differences between the frequency and time dimensions?
There are a number of interesting directions you could go with that data set. One interesting possibility is to make a convolutional autoencoder, then use that to apply "deep dreaming" filters to music. Another interesting evolution would be to handle the frequency dimension using a 1D convolution, and run a RNN on top of that to deal with time.
Music recommendation is a relatively easy problem on one level, and a huge problem on another. If you are recommending music to a neophyte of a certain genre, we've clearly been able to do this for awhile in a way that has real value. But if you're trying to recommend music for someone who is an expert/aficionado of a certain genre, this inevitably annoys that sort of person. For the 2nd type of recommendation, it's hard to provide results of actual interest. Instead, you wind up getting recommendations for pale imitations of things you like. The 2nd problem might require something close to hard sentient AI to accomplish.
This is pretty cool. Maybe I'm missing something, but what's the point in the initial genre training?
He's taking 185000 samples, and finding similar "looking" samples elsewhere in other songs, and then making recommendations based on that. I don't see what that could possibly have to do with genre labels, unless we're under the assumption that finding a match between a Drum & Bass song and one that seems similar with a tag of Trance is somehow a bad match? (which very well could be the case, but seems like a big assumption to make off the bat)
Are these recommendations silo'd to the current genre or are they allowed to span genres?
Very cool post! :)
"Simple" method (good ol' spectrograms, and something people can realistically actually reproduce without requiring a GPU farm), and great results!
That's not how you build a recommendation engine... You build a recommendation engine by creating an embedding from each song from which user prefers them, as you would for words in word to vec.
This is how Amazon and Youtube do it.
Couldn't you view the output of the last layer of the convnet used as the embedding in this case? Yes, this was a different approach than leveraging user preferences, but I don't see why this is inherently the wrong approach.
My first thought was to wonder how a LSTM would do. Once might think it would be a better representation for music? There's some models which use convolutional layers along with a LSTM for video representation (eg [1]) and it would be interesting to see if convolutions are useful for capturing similar themes of music.
I wonder if one could build a music embedding (word2vec style) and use similarities in the embedding space as recommendations? The obvious objective function would be skip-gram, but there might be more interesting objectives there too.
I could be totally off on this, but his encoding is an image and LSTM is for time series, which would require a different representation.
I completely agree LSTM would be useful as it would by default require a different representation. I think most commenters agree this representation is overly simplistic. Amazed it works as well as it does!
> Don't you guys realize that putting everything from Monteverdi to Bach, Mozart, Beethoven, Brahms, Moussorgsky, Stravinsky, and Bernstein in the same "Classical" bucket makes no sense?
> (Particularly when you have ultra fine-grained categories for popular music!)
My understanding of convolutions is that it's a way of extracting patterns from images. To convert audio into an image and then create convolutions from that seems... convoluted, if you will. I imagine a better way would be to think of what the equivalent of a convolution would be in the audio space? I.e. noise detection, treble/bass filters, etc.?
Convolution is generic signal-processing. It's quite common to use a one-dimensional convolution for audio filters, it would work perfectly fine as a bass filter for example.
However, 2D conv+maxpool is an image processing technique that gets you translation invariance. Fine for the time dimension of the spectrogram, but rather dubious for the frequency axis. Surely you'd want to distinguish if some feature happens at a high or low frequency?
> Fine for the time dimension of the spectrogram, but rather dubious for the frequency axis.
MFCCs[1] are exactly that, a type of convolution along the frequency axis of a Fourier transform, and are highly apt features for music classification tasks.
It makes sense if you think of timbre as a time-varying relationship between the harmonics of a single pitch; translation invariance along the frequency axis can tell you that you there are partials typical e.g. of a guitar or of a flute, without caring what particular pitch those instruments are playing.
And timbre is a bigger source of variety in popular music than e.g. the particular notes used.
Why not checking which are the top 3 most played songs by other users who are the 1000 users who have the most similarity with the current user, and then recommend the current user the most played songs from the 1000 similar users that the current user has not listened to yet.
As far as I can see this would be superior to any existing A.I. recommendation algorithm.
What you're describing is also an "A.I". It's called collaborative filtering, and your algorithm (picking top 3 of the 1000 most similar users) would give results heavily biased towards popular songs, there are better approaches in that field.
My 1 min effort
description would be biased towards popular songs, but you can easily change that by selecting songs that are not popular, but that occupy a lot of playtime with a user.
Warning: this comment has little to do with the article, beyond being a rant on the approach taken by all recommendation engines I've seen.
This an interesting approach, but the objective is similar to most recommendation engines: "Find me something similar to something I like". Sometimes that's a good requirement (e.g. when trying to queue up the next song in a playlist, it's good to have some similarity to the song you're currently listening to). However, when trying to discover new music it's generally a bad approach; since (depending how the requirement is tackled) you'll get recommendations that tend towards some median; i.e.:
- Other songs by the same artist
- Songs by artists who have collaborated with the current artist
- Popular songs (i.e. if almost everyone has a Beetles album in their playlist, getting "people who bought this also bought" recommendations for anything would list Beetles, since technically that's true; it's just uninteresting.
- Songs in the same genre
- Songs with a similar sound / structure
i.e. it tends to list things which you're likely to be aware of anyway. Also this means you'll get lots of songs with little variety between them; making your playlists monotonous.
What I'd be really interested in seeing was an engine which finds things on the peripheral; i.e. figures out the things that are likely to appeal to you because of the more unique things you're interested in; or the popular things that you dislike. That way you're likely to get a more eclectic mix of suggestions, and broaden your musical awareness. This would likely produce a lot more false positives initially, as it's expanding your taste range rather than narrowing in on some "ideal" average, so may stray into unknowns; but once you've heard and rated something in this new area, that data can quickly feedback into the algorithm and thus you learn of things you'd previously never have discovered.
> Popular songs (i.e. if almost everyone has a Beetles album in their playlist, getting "people who bought this also bought" recommendations for anything would list Beetles
I've been learning recommendation engines by looking at peoples' Steam games libraries.
One feature of the data set is that many, many people own multiple versions of Counter-Strike as well as Team Fortress 2. So "a high number people who bought [almost any game] also bought Counter-Strike: Global Operations" is a recurring problem with a naive recommender.
What I've been learning how to do is weight recommendations by how 'surprising' they are, for want of a more accurate term. If 80% of people who own Game A also own Game B, but only 5% of the total population owns Game B, then we should upweight that relationship.
What you complained was exactly what I would complain about Spotify's suggestions some time ago.
But as of, 3-6 months ago those daily mixes started putting some really interesting new songs that I wouldn't find otherwise. Sometimes it seems to go back to that "safe zone" but it's been such a much better experience I have been telling all my friends to try it.
I really would like to know more about their process to improve the recommendation system.
Couldnt' agree more. Spotify seems to be solving this problem. No other rec engine I've seen is as good at finding artists I've never heard of, who I really dig
My problem with all music recommendation engines, and for many intellectual music aficionados, the lyrics content - what is being verbally described in the music - is what I seek and hang on for my preferred music. When I listen to my collection, the genres are all over and I don't even know them. I listen to the words and treat the music as emphasis for the words. I'll have ska, 30's jazz, hip hip, and classic rock all in the same mix and it works because the lyric content is different takes on the same things. In fact, new friends are sometimes dizzy from my music choices, and then at some point they hear the thematic concept of my mixes and get it.
Not something I'd ever thought of (I tend to tune out the words / mostly treat them as another instrument; unless listening to something especially witty).
Great suggestion / I guess this leads to the idea of needing a meta recommendation engine; i.e. some way to decide what recommendation engine best works for you; selecting from one that follows lyrical themes, another that discovers "out there" content, one for similar content, etc.
Having a selection that has lyrical continuation from one song into an another is very typical also in reggae.
Reggae as "genre" itself is also quite varied in what goes under its label. There are also other factors that play a big weight on how good matches they are to reference material. Producer and decade make a huge difference but also what's known as "riddim" name should give clues.
Which just means "people similar to you who like x also like y".
This does make a lot more sense than analyzing the audio of the music IMO. For example youtube does this okay and if you look for a Mazzy Star song after watching Ricky and Morty (a tv show), it will recommend other Ricky and Morty soundtracks even if the style is completely different. This isn't something you can predict with just audio data.
I totally agree with you.
Sometimes you love a song from the first listen, it bites you even if came from an artist that you don't know.
My dream is a suggestion engine that "examine" the melody, the harmony, the frequencies that make the song and finds songs that are similar based on that parameters.
Probably a signal analysis could help in finding why you like that songs.
Amazon tends to recommend things I like and the majority of the purchases have been because of their recommendations (good job Amazon, your software is doing it's job, raising sales). I think books and music are different though. Books have well defined categories. If I buy a pop-psych book, say "Blink", and then I am recommended "Peak: Secrets from the New Science of Expertise" it's pretty much going to be very likely I buy that too if I am interested in that subject.
If however I want recommendations for new Metal music, and my previous selection was Metallica then you play me some Megadeth, I am going to hate it and not be interested in it at all!
There is a simple music recommender webapp shown in the video. From your model you got a python function that maps one song (e.g. by artist,title) to other songs. What is the fastest way to build this interactive webapp (for internal, experimental) use?
So, I believe that this is actually supervised learning, as the author is training a classifier on preexisting labels (the genres).
I believe that unsupervised learning would not make use of a target variable at all. If the network architecture terminated at the fully connected layer, and then propagated that layer backwards to reconstruct the input (something like Contrastive Divergence), that would be an unsupervised method.