Music Genre Classification With TensorFlow

The rise of music streaming services has made music ubiquitous. We listen to music during our commute, while we exercise, work or simply to relax. The ongoing disturbance in our daily lives has not dampened the role of music to elicit emotion and process our thoughts, as exemplified in the emergence of “Zoom concerts”.

One key feature of these services is the playlists, often grouped by genre. This data could come from manual labeling by the people publishing the songs. But this does not scale well and might be gamed by artists who want to capitalize on the popularity of a specific genre. A better option is to rely on automated music genre classification. With my two collaborators Wilson Cheung and Joy Gu, we sought to compare different methods of classifying music samples into genres. In particular, we evaluated the performance of standard machine learning vs. deep learning approaches. What we found is that feature engineering is crucial, and that domain knowledge can really boost performance.

After describing the data source used, I give a brief overview of the methods we used and their results. In the last part of this article, I will spend more time explaining the way the TensorFlow framework in Google Colab can perform these tasks efficiently with GPU or TPU runtimes thanks to the TFRecord format. All the code is available here, and we are happy to share our more detailed report with anyone interested.

Data Source

Predicting the genre of an audio sample is a supervised learning problem (for a good primer on supervised vs. unsupervised learning, I recommend Devin’s article on the topic). In other words, we needed data that contains labeled examples. The FreeMusicArchive is a repository of audio segments with relevant labels and metadata, which was originally collected for a paper at the International Society for Music Information Retrieval Conference (ISMIR) in 2017.

We focused our analysis on the small subset of the data provided. It contains 8,000 audio segments, each 30 seconds in length and classified as one of eight distinct genres:

Hip-Hop
Pop
Folk
Experimental
Rock
International
Electronic
Instrumental

Each genre comes with 1,000 representative audio segments. With the sample rate of 44,100 Hz, this means there are more than 1 million data points for each audio sample, or more than 10⁹ data points total. Using all of this data in a classifier is a challenge, which we will discuss more in upcoming sections.

For instructions on how to download the data, please refer to the README included in the repository. We were very grateful to Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson for putting this data together and making it freely available, but can only imagine the insights that become available at the scale of the data owned by Spotify or Pandora Radio. With this data, we can describe various models to perform the task at hand.

Model Description

I will keep the theoretical details to a minimum, but will link to relevant resources whenever possible. In addition, our report contains a lot more information than what I can include here, in particular around feature engineering, so do let me know in the comments if you would like me to share it with you.

Standard Machine Learning

We used Logistic regression, k-nearest neighbors (kNN), Gaussian Naive Bayes and Support Vector machines (SVM):

SVM tries to find the best decision boundary by maximizing the margin with the training data. The kernel trick defines non-linear boundaries by projecting data to a high-dimensional space
kNN assigns a label based on a majority vote of the k closest training samples
Naive Bayes predicts the probability of different classes based on features. The conditional independence assumption greatly simplifies calculations
Logistic regression also predicts the probability of different classes by modeling the probability directly, leveraging the logistic function

Deep Learning

For deep learning, we leverage the TensorFlow framework (see more details in the second part of this article). We built different models based on the type of inputs.

With raw audio, each example is an audio sample of 30s, or approximately 1.3 million data points. These floating point values (positive or negative) represent the wave displacement at a certain point in time. In order to manage computational resources, less than 1% of the data can be used. With these features and the associated label (one-hot encoded), we can build a convolutional neural network. The general architecture is as follows:

1-Dimensional convolutional layers, where the filters combine information from contingent data
MaxPooling layers, which combine information from convolutional layers
Dense layers, fully connected to create a linear combination of the extracted convolutional features and perform the final classification
Dropout layers, which help the model generalize to unseen data

#deep-learning #machine-learning #tensorflow #developer

Data Source

Model Description

Standard Machine Learning

Deep Learning

towardsdatascience.com

Music Genre Classification With TensorFlow