Deep Learning has changed the game in speech recognition with the introduction of end-to-end models. These models take in audio, and directly output transcriptions. Two of the most popular end-to-end models today are Deep Speech by Baidu, and Listen Attend Spell (LAS) by Google. Both Deep Speech and LAS, are recurrent neural network (RNN) based architectures with different approaches to modeling speech recognition.

Deep Speech uses the Connectionist Temporal Classification (CTC) loss function to predict the speech transcript. LAS uses a sequence to sequence network architecture for its predictions.

These models simplified speech recognition pipelines by taking advantage of the capacity of deep learning system to learn from large datasets. With enough data, you should, in theory, be able to build a super robust speech recognition model that can account for all the nuance in speech without having to spend a ton of time and effort hand engineering acoustic features or dealing with complex pipelines in more old-school GMM-HMM model architectures, for example.

Deep learning is a fast-moving field, and Deep Speech and LAS style architectures are already quickly becoming outdated. You can read about where the industry is moving in the Latest Advancement Section below.

How to Build Your Own End-to-End Speech Recognition Model in PyTorch

Let’s walk through how one would build their own end-to-end speech recognition model in PyTorch. The model we’ll build is inspired by Deep Speech 2 (Baidu’s second revision of their now-famous model) with some personal improvements to the architecture.

The output of the model will be a probability matrix of characters, and we’ll use that probability matrix to decode the most likely characters spoken from the audio. You can find the full code and also run the it with GPU support on Google Colaboratory.

#deep-learning #machine-learning #artificial-intelligence #tensorflow #python

How to Build Your Own End-to-End Speech Recognition Model in PyTorch
43.15 GEEK