Music Genre Recognition

machine_learning

Introduction

This is the first machine learning model to be documented by this blog. It was deployed using streamlit and is based on the model developed for my dissertation. It's purpose is to perform Music Genre Recognition (MGR), or in simple terms, it will classify music according to its genre.

You can try the model out here. Currently a username and password are required but if you contact me via linkedIn I can provide these.

MGR is a widely studied academic discipline with Google Scholar returning 652,000 results for the search term "music genre recognition". A comprehensive literature review revealed that the most common dataset used when designing a model for MGR purposes was the GTZAN dataset. Accuracies in the mid 90% range were deemed to be achievable when evaulating the model on a subsapmple of the GTZAN dataset.

It was decided that one particular gap in knowledge in this area centers around explainability. Therefore the novel component to this research was to implement a SHAP analysis of the predictions made by the model, enhancing the accountability of the model and trust in its predictions. This was not part of the model I deployed however, I chose to keep it simple and focus on making predictions since this was the first time I had used streamlit.

This blog post won't reiterate too much of the content of the dissertation and will only briefly touch on the code (which can be found here). Instead it will focus on the real world value of the model and discuss some improvements that can be made.

Dataset

As mentioned above the GTZAN dataset was used. This consists of 1000 audio snippets of 30 seconds duration across 10 genres of music. The genres are as follows:-

ClassGenre
0rock
1jazz
2blues
3hiphop
4metal
5pop
6classical
7country
8reggae
9disco

The first thing to be said about this list of genres is that they are very broad, no zydeco or crunkwave here 😂. This is a good thing though, I would argue that these genres encapsulate maybe 90% of all the music I can think of. The notable exceptions in this list are folk and dance/electronic. Another positive is that it keeps things relatively simple, therefore requiring a less complex model, decreasing the space required to store the data and decreasing the training times. So the GTZAN dataset is a good starting point for MGR, this is why it is referred to on its kaggle datacard as "the MNIST of sounds" (referring to the famous and ubiquitous MNIST dataset of handwritten digits).

Implementation

The basic idea behind the implementation is to use the Librosa python library to convert the audio into spectrograms. This is because the audio is required to be converted into visual information before being processed with a transfer learning enhanced convolutional neural network (CNN). Something worth noting at this point is that before training the model I augmented the data by subdividing it into 3 second segments, resulting in 10,000 samples of 3 seconds duration.

# load the audio with librosa
signal, sr = librosa.load("30s_audio.wav", sr=SAMPLE_RATE)

# create segments
for s in range(NUM_SEGMENTS):
    start_sample = samples_per_segment * s
    finish_sample = start_sample + samples_per_segment

    # generate spectrograms for each segment
    mel_spectrogram_split = librosa.feature.melspectrogram(
        y=signal[start_sample:finish_sample],
        sr=sr,
        n_fft=N_FFT,
        hop_length=HOP_LENGTH,
        n_mels=N_MELS,
    )
    log_mel_spectrogram_split = 
      librosa.power_to_db(mel_spectrogram_split)
    mel_spectrogram_split_normalized = librosa.util.normalize(
        log_mel_spectrogram_split
    )

Since the dataset now consists of 3 second spectrograms this means that 3 second spectrograms need to be extracted from the audio uploaded by the user to predict. To achieve this I took the middle 30 seconds from the song as I thought this would be the most representative of the music (an assumption I know) and then split this into 10 spectrograms of 3 seconds duration. The final prediction is a vote based on the most common prediction from these 10 spectrograms and the probability is the average probability. The code can also output multiple predictions with their probability in case of a tie, but this is rare.

Results

Test Set

The model, when trained locally pre-deployment, took about 2 hours to train. On the test set (which consisted of 20% of the dataset) it achieved 91% accuracy. This is a promising result and is up there with some of the best results from the literature review.

Real World Music

This section is where I have the most to say about the model. Does a model displaying 91% accuracy on the test set have real world value? That is to say, doe it produce satisfying results when predicting the genres of songs that users actually upload? Putting aside the fact that there are only 10 genres in the dataset (ideally there would be more... lots more), I would honestly conclude that the model is very limited.

It predicts that a Brahms viola sonata is classical music correctly with a probability of 99.5%. It also correctly predicts that Iron Maiden is metal and that Kurt Vile's "Pretty Pimpin" is rock music along with many other examples where I would agree with the predictions.

However the model does not classify Elmore James as blues, it incorrectly classifies his music as metal. It classifies King Gizzard's "Yours" as disco when in my opinion it should be either rock or pop. The most egregious misclassification I've come across is Union Station's "Man of Constant Sorrow" (an archetypal country song) which is classified as either classical or reggae. Yes the probabilities are often lower when the predictions are wrong but sometimes they aren't.

I would estimate how often it produces what I class as a correct prediction but the thing is... I know when it's going to be correct because I've inspected the dataset. I know that anything that sounds like more traditional jazz music to me (for example 60's bebop) is going to be predicted as jazz music, or that music that sounds like 90's grunge (e.g. Pearl Jam or Nirvana) is going to come out as rock music with a high probability.

Conclusion

So to wrap things up, what we have here is a case of a model that has displayed a high degree of accuracy on the test set, but due to the limited nature of the dataset it does not generalise well to real world music. Why is this happening? Take reggae as an example. More than a third of the audio labelled reggae in the dataset is Bob Marley songs and the model accordingly responds well to Bob Marley songs, but what about Damien Marley? Damien Marley's music is often classified as belonging to the dub genre. According to wikipedia "Dub is a musical style that grew out of reggae in the late 1960s and early 1970s". However the model classifies "Welcome to Jamrock" by Damien Marley as pop.

My point is that the nuances of musical genre are not captured by the GTZAN dataset. I wouldn't expect them to be, there are afterall only 100 songs from each genre. A more robust model would require a far larger dataset, ideally with more labelled genres in.

This project has been enlightening however. I shouldn't undersell the model... it's decent, in fact transfer learning has extracted a lot of predictive power from a small dataset. If I were to do this again I would probably look at using the Spotify million song dataset which should be more up to date with subgenres. I also learnt streamlit in the process which will be an invaluable library for deploying future projects.

Hope you enjoy using the model, don't forget you can ask me for login details.