Stop using Image Interpolation for Neural Audio Synthesis

The different types of Neural Upsamplers and which one you should use in your Deep Learning Audio Synthesis Project.

In this story I want to advance your current understanding of neural upsamplers in the context of audio synthesis. And provide a simple Subpixel1D Keras layer implementation to use as a drop-in replacement for many of the tasks we discuss today.

We all know that up- and down sampling is an important operation in deep learning for computer vision, e.g., in tasks like image super resolution or image generation. The same holds true for audio synthesis using popular architectures like GANs, U-Nets or Auto-encoder. While downsampling is a relatively simple operation, there always have been difficulties finding a good upsampling strategy which doesn’t result in image or audio artifacts. For a primer on 2-dimensional checkerboard artifacts in computer vision tasks read this great post [1].

Now let us dive deeper into 1-dimensional audio upsampling. In the audio domain we use three main upsampling techniques [2]:

Transposed convolutions (widely used)
Interpolation + convolution (often used)
Subpixel convolutions (rarely used but prominent in vision tasks)

Examples of their usage can be found in many publications, like, Demucs (music source separation) [3], MelGAN (waveform synthesis) [4], SEGAN (speech enhancement) [5], Conv-TasNet (speech separation) [6] or Wave U-Net (source separation) [7].

TensorFlow Keras provides a fourth solution to upsampling which is the UpSample1D layer, however, as of now (March 21) this layer still is outrageously slow on GPU, although the issue is closed.

#deep-learning #audio #editors-pick #tensorflow #machine-learning

The different types of Neural Upsamplers and which one you should use in your Deep Learning Audio Synthesis Project.

towardsdatascience.com

Stop using Image Interpolation for Neural Audio Synthesis