Imagine going into a crowded and noisy cafeteria and being able to “unmix” all the noises you hear into their respective speakers. This is the problem of audio source separation: decomposing an input mixed audio signal into the sources that it originally came from.

Audio Source Separation, also known as the Cocktail Party Problem, is one of the biggest problems in audio because of its practical use in so many situations: identifying the vocals from a song, helping deaf people hear a speaker in a noisy area, isolating the voice in a phone call when riding a bike against the wind, and you get the idea.

We propose a way to solve this problem using the advents of deep learning.

Our Data

When approaching this problem, we used the UrbanNoise8K and LibriSpeech datasets. The UrbanNoise dataset provided us with various background noises while the LibriSpeech dataset provided us with various people reading a book without any background noise. We generated tones ourselves for tone separation.

Our Approach

After doing initial research on the topic, we decided to use spectrograms to work with audio instead of the raw waveforms. Although Lyrebird’s MelGAN and OpenAI’s brand new Jukebox have found success in generating raw audio waveforms, the time frame of our project and depth of our neural network did not permit us to interface with raw audio, because of its incredible complexity. Using spectrograms (raw STFT outputs), we don’t lose information about the audio waveform and can recreate it with very minimal loss.

For novelty, we decided to try out different unique loss functions and training architectures, outlined below.

The Model

We tried many different models for this experiment from normal convolution neural networks to dense layers, but we decided to settle with two models A and B. Both of them had a U-Net style architecture described below.

The U-Net architecture initially found success in image segmentation but was adopted in spectrogram analysis later on as well. The following picture summarizes the architecture. It involves downsampling layers that reduce the height and width of the image until a point. Then it upsamples the image using some strategy. Residuals(skip) connections are added between layers which have the same shape between the corresponding upsample and downsample layers.

Image for post

U-Net model adapted for audio. Retrieved from this paper.

We implemented Model A from scratch in PyTorch based on the diagram above from the paper Choi et al. cited in the references. The idea is very similar to the U-Net in downsampling and upsampling with skip connections. The Neural Transform Layers only change the number of the channels in the image while the downsample/upsample layers change the height and width of the image by a factor of 2 (simple convolution). The number of neural transform layers and downsampling layers varied from 7 to 17.

Each neural transform layer is a dense block of 4 Batch-Norm, ReLU, and Convolution operations illustrated below. The dense block means that each smaller block receives all the previous blocks’ outputs as inputs, essentially adding channels onto a “global state” variable. The following picture illustrates the idea.

Image for post

Neural Transform Layer diagram. Retrieved from this paper.

Model B was a smaller version of Model A without the batch-norm layers. This found success in the blog post by Belz cited below.

The input of the model A and model B is a spectrogram of the mixed signal. For model A, the output value is a spectrogram of the predicted voice, and the target value is a spectrogram of the target isolated voice. One of the variations in model B is that instead of predicting voice, the output value is a spectrogram of the predicted noise, and the target is a spectrogram of target noise. The voice in model B is predicted by subtracting predicted noise signal from input mixed-signal.

Image for post

Magnitude spectrogram description. Source: Vincent Belz, via Medium.

There are two different types of spectrograms we used. One type was a simple STFT output that resulted in a complex-valued spectrogram and was represented as a two channeled image representing the real and imaginary parts. The other type involved only feeding the magnitude of the STFT as a one-channel image and receiving a magnitude spectrogram at the end. The phrase used for reconstruction was simply the initial STFT phase (phase does not change), as shown in the figure above.

#deep-learning #audio #speech-separation #spectrogram #convolutional-network #neural networks

Audio Source Separation with Deep Learning
1.65 GEEK