Calculating Audio Song Similarity Using Siamese Neural Networks

Calculating Audio Song Similarity Using Siamese Neural Networks

At AI Music, where our back catalogue of content grows every day, it is becoming increasingly necessary for us to create more intelligent systems for searching and querying the music.

Introduction

At AI Music, where our back catalogue of content grows every day, it is becoming increasingly necessary for us to create more intelligent systems for searching and querying the music. One such system for doing that can be dictated by the ability to define and quantify the degree of similarity between songs. The core methodology described here tackles the concept of acoustic similarity.

Searching for a song using descriptive tags often introduces the issue of semantic inconsistencies. Tags can be highly subjective by age group, culture, and personal preference of a listener. For example, descriptors such as ‘bright’ or ‘cold’ could mean entirely different things to different people. Music can also sit in blurry areas when it comes to genre. A song such as Sabotage by the Beastie Boys is primarily known as a Hip-Hop/Rap song, yet it contains a lot of the sonic qualities we would traditionally attribute to a Rock song. The ability to use an example reference track to retrieve a similar song or ranked list of similar songs from a large catalogue avoids such issues.

Nevertheless, when we perceive two or more songs to be similar to one another what does this actually mean? This perceived similarity is often very difficult to define as it comprises a number of different aspects, such as genre, instrumentation, mood, tempo and many more. To complicate the problem further, similarity tends to be made of an unrestricted combination of such characteristics. With song similarity being such a subjective concept, how are we tackling the issue of defining a ground truth?

How did we approach the problem?

Traditional methods for determining the similarity between songs require you to select and extract music features from the audio. How close or far these features are to one another within a space is then presumed to be the perceptual similarity of the respective tracks. One problem when employing this approach is how to determine which features best map to the perceived similarity. At AI Music, we tackle this problem by employing an approach based on Siamese Neural Networks (SNN).

The SNN architecture is based on a Convolutional Neural Network architecture, which means we needed to transform the audio into an image. The most common image representation of audio is a waveform where the signal amplitude is plotted against time. For our application we use a visual representation of the audio known as a spectrogram, specifically a mel spectrogram.

  • spectrogram uses the Fourier transform to produce a frequency distribution of the signal against time.
  • mel spectrogram is a spectrogram where the frequencies are mapped to the mel scale.
  • The mel scale is log spaced, resulting in a representation that more closely correlates with human hearing.

We have chosen mel spectrograms as they have been found to be good representations for the timbre of a sound and are therefore better representations of the acoustic characteristics of a song.

Image for post

Figure 1: Comparison of waveform, spectrogram and mel spectrogram

As we can see from the above image, relevant musical information is revealed more clearly in the mel spectrogram.

deep-learning machine-learning audio artificial-intelligence data-science

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Artificial Intelligence (AI) vs Machine Learning vs Deep Learning vs Data Science

Artificial Intelligence (AI) vs Machine Learning vs Deep Learning vs Data Science: Artificial intelligence is a field where set of techniques are used to make computers as smart as humans. Machine learning is a sub domain of artificial intelligence where set of statistical and neural network based algorithms are used for training a computer in doing a smart task. Deep learning is all about neural networks. Deep learning is considered to be a sub field of machine learning. Pytorch and Tensorflow are two popular frameworks that can be used in doing deep learning.

Most popular Data Science and Machine Learning courses — July 2020

Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant

Artificial Intelligence vs. Machine Learning vs. Deep Learning

Simple explanations of Artificial Intelligence, Machine Learning, and Deep Learning and how they’re all different

Artificial Intelligence, Machine Learning, Deep Learning 

Artificial Intelligence (AI) will and is currently taking over an important role in our lives — not necessarily through intelligent robots.

Data Augmentation in Deep Learning | Data Science | Machine Learning

Data Augmentation is a technique in Deep Learning which helps in adding value to our base dataset by adding the gathered information from various sources to improve the quality of data of an organisation.