Philian Mateo

Philian Mateo

1569222931

Top 3 Deep Learning Frameworks for End-to-End Speech Recognition

Introduction

Speech recognition is invading our lives. It’s built into our phones (Siri), our game consoles (Kinect), our smartwatches (Apple Watch), and even our homes (Amazon Echo). But speech recognition has been around for decades, so why is it just now hitting the mainstream?

The reason is that deep learning finally made speech recognition accurate enough to be useful outside of carefully-controlled environments. In this blog post, we’ll learn how to perform speech recognition with 3 different implementations of popular deep learning frameworks.

Speech Recognition - The Classic Way

In the era of OK Google, I might not really need to define ASR, but here’s a basic description: Say you have a person or an audio source saying something textual, and you have a bunch of microphones that are receiving the audio signals. You can get these signals from one or many devices, and then pass them into an ASR system — whose job it is to infer the original source transcript that the person spoke or that the device played.

So why is ASR important?

Firstly, it’s a very natural interface for human communication. You don’t need a mouse or a keyboard, so it’s obviously a good way to interact with machines. You don’t even really need to learn new techniques because most people learn to speak as a function of natural development. It’s a very natural interface for talking with simple devices such as cars, handheld phones, and chatbots.

So how is this done classically?

As observed above, the classic way of building a speech recognition system is to build a generative model of language. On the rightmost side, you produce a certain sequence of words from language models. And then for each word, you have a pronunciation model that says how this particular word is spoken. Typically it’s written out as the sequence of phonemes — which are basic units of sound, but for our vocabulary, we’ll just say a sequence of tokens — which represent a cluster of things that have been defined by linguistics experts.

Then, the pronunciation models are fed into an acoustic model, which basically defines how does a given token sounds. These acoustic models are now used to describe the data itself. Here the data would be x, which is the sequence of frames of audio features from x1 to xT. Typically, these features are something that signal processing experts have defined (such as the frequency components of the audio waveforms that are captured).

Each of these different components in this pipeline uses a different statistical model:

In the past, language models were typically N-gram models, which worked very well for simple problems with limited speech input data. They are essentially tables describing the probabilities of token sequences.

The pronunciation models were simple lookup tables with probabilities associated with pronunciations. These tables would be very large tables of different pronunciations.

Acoustic models are built using Gaussian Mixture Models with very specific architectures associated with them.

The speech processing was pre-defined.

Once we have this kind of model built, we can perform the recognition by doing the inference on the data received. So you get a waveform, you compute the features for it (X) and do a search for Y that gives the highest probabilities of X.

Machine learning models don’t have to live on servers or in the cloud — they can also live on your smartphone. 

The Neural Network Invasion

Over time, researchers started noticing that each of these components could work more effectively if we used neural networks.

Instead of the N-gram language models, we can build neural language models and feed them into a speech recognition system to restore things that were produced by a first path speech recognition system.

Looking into the pronunciation models, we can figure out how to do pronunciation for a new sequence of characters that we’ve never seen before using a neural network.

For acoustic models, we can build deep neural networks (such as LSTM-based models) to get much better classification accuracy scores of the features for the current frame.

Interestingly enough, even the speech pre-processing steps were found to be replaceable with convolutional neural networks on raw speech signals.

However, there’s still a problem. There are neural networks in each component, but they’re trained independently with different objectives. Because of that, the errors in one component may not behave well with the errors in another component. So that’s the basic motivation for devising a process where you can train the entire model as one big component itself.

These so-called end-to-end models encompass more and more components in the pipeline discussed above. The 2 most popular ones are (1) Connectionist Temporal Classification (CTC), which is in wide usage these days at Baidu and Google, but it requires a lot of training; and (2) Sequence-To-Sequence (Seq-2-Seq), which doesn’t require manual customization.

The basic motivation is that we want to do end-to-end speech recognition. We are given the audio X — which is a sequence of frames from x1 to xT, and the corresponding output text Y — which is a sequence of y1 to yL. Y is just a text sequence (transcript) and X is the audio processed spectrogram. We want to perform speech recognition by learning a probabilistic model p(Y|X): starting with the data and predicting the target sequences themselves.

1. Connectionist Temporal Classification

The first of these models is called Connectionist Temporal Classification (CTC) ([1], [2], [3]). X is a sequence of data frames with length T: x1, x2, …, xT, and Y is the output tokens of length L: y1, y2, …, yL. Because of the way the model is constructed, we require T to be greater than L.

This model has a very specific structure that makes it suitable for speech:

You get the spectrogram at the bottom (X). You feed it into a bi-directional recurrent neural network, and as a result, the arrow pointing at any time step depends on the entirety of the input data. As such, it can compute a fairly complicated function of the entire data X.

This model, at the top, has softmax functions at every timeframe corresponding to the input. The softmax function is applied to a vocabulary with a particular length that you’re interested in. In this case, you have the lowercase letters a to z and some punctuation symbols. So the vocabulary for CTC would be all that and an extra token called a blank token.

Each frame of the prediction is basically producing a log probability for a different token class at that time step. In the case above, a score s(k, t) is the log probability of category k at time step t given the data X.

In a CTC model, if you look at just the softmax functions that are produced by the recurring neural network over the entire time step, you’ll be able to find the probability of the transcript through these individual softmax functions over time.

Let’s take a look at an example (below). The CTC model can represent all these paths through the entire space of softmax functions and look at only the symbols that correspond to each of the time steps.

As seen on the left, the CTC model will go through 2 C symbols, then through a blank symbol, then produce 2 A symbols, then produce another blank symbol, then transition to a T symbol, and then finally produce a blank symbol again.

So when you go through these paths with the constraint, you can only transition between the same phoneme from one step to the next. Therefore, you’ll end up with different ways of representing an output sequence.

For the example above, we have cc <b> aa <b> t <b> or cc <b> <b> a <b> t <b> or cccc <b> aaaa <b> tttt <b>. Given these constraints, it turns out that even though there’s an exponential number of paths by which you can produce the same output symbol, you can actually do it correctly using a dynamic programming algorithm. Because of dynamic programming, it’s possible to compute both the log probability p(Y|X) and its gradient exactly. This gradient can be backpropagated to a neural network whose parameters can then be adjusted by your favorite optimizer!

Below are some results for CTC, which show how the model functions on given audio. A raw waveform is aligned at the bottom, and the corresponding predictions are outputted at the top. You can see that it produces the symbol H at the beginning. At a certain point, it gets a very high probability, which means that the model is confident that it hears the sound corresponding to H.

However, there are some drawbacks to CTC language models. They often misspell words and struggle with grammar. So if you had some way to figure out how to rank the different paths produced from the model and re-rank them just by the language model, the results should be much better.

Google actually fixed these problems by integrating a language model as part of the CTC model itself during training. That’s the kind of production model currently being deployed with OK Google.

2. Online Sequence-to-Sequence

Online sequence-to-sequence models are designed to overcome the limits of sequence-to-sequence models—you don’t want to wait for the entire input sequence to arrive, and you also want to avoid using the attention model itself over the entire sequence. Essentially, the intention is to produce the outputs as the inputs arrive. It has to solve the following problem: is the model ready to produce an output now that it’s received this much input?

The most notable online seq-2-seq model is called a Neural Transducer [5]. If you take the input as it comes in, and every so often at a regular interval, you can run a seq-2-seq model on what’s been received in the last block. As seen in the architecture below, the encoder's attention (instead of looking at the entire input) will focus only on a little block. The transducer will produce the output symbols.

The nice thing about the neural transducer is that it maintains causality. More specifically, the model preserves the disadvantage of a seq-2-seq model. It also introduces an alignment problem: in essence, what you want to know is that you have to produce some symbols as outputs, but you don’t know which chunk these symbols should be aligned to.

You can actually make this model better by incorporating convolutional neural networks, which are borrowed from computer vision. The paper [6] uses CNNs to do the encoder side in speech architecture.

You take the traditional model for the pyramid as seen to the left, and instead of building the pyramid by simply stacking 2 things together, you can put a fancy architecture on top when you do the stacking. More specifically, as seen below, you can stack them as feature maps and put a CNN on the top. For the speech recognition problem, the frequency bands and the timestamps of the features that you look at will correspond to a natural substructure of the input data. The convolutional architecture essentially looks at that substructure.

3. Sequence-To-Sequence

An alternative approach to speech processing is the sequence-to-sequence model that makes next-step predictions. Let’s say that you’re given some data X and that you need to produce some symbols y1 to y{i}. The model predicts the probability of the next symbol of y{i+1}. The goal here is to basically learn a very good model for p.

With the model architecture (left), you have a neural network (which is the decoder in a sequence-to-sequence model) that looks at the entire input (which is the encoder). It feeds in the path symbols that are produced as a recurrent neural network, and then you predict the next token itself as the output.

So this model does speech recognition with the sequence-to-sequence framework. In translation, the X would be the source language. In the speech domain, the X would be a huge sequence of audio that’s now encoded with a recurrent neural network.

What it needs to function is the ability to look at different parts of temporal space, because the input is really long. Intuitively, translation results get worse as the source sentence becomes longer. That’s because it’s really difficult for the model to look in the right place. Turns out, that problem is aggravated a lot more with audio streams that are much longer. Therefore, you would need to implement an attention mechanism if you want to make this model work at all.

Seen in the example on the left, you’re trying to produce the 1st character C. You create an attention vector that essentially looks at different parts of the input time steps and produces the next chapter (which is A) after changing the attention.

If you keep doing this over the entire input stream, then you’re moving forward attention just learned by the model itself. Seen here, it produces the output sequence “cancel, cancel, cancel.”

The Listen, Attend, and Spell [4] model is the canonical model for the seq-2-seq category. Let’s look at the diagram below taken from the paper:

In the Listener architecture, you have an encoder structure. For every time step of the input, it produces a vector representation that encodes the input and is represented as h_t at time step t.

In the Speller architecture, you have a decoder architecture. You generate the next character c_t at every time step t.

The LAS model uses a hierarchical encoder to replace the traditional recurrent neural network. Instead of processing one frame for every time step, it collapses neighboring frames as you feed into the next layer. Because of that, it reduces the number of time steps to be processed, thus making the processing faster.

So what are the limitations of this model?

One of the big limitations preventing its use in an online system is that the output produced is being conditioned on the entire input. That means if you’re going to put the model in a real-world speech recognition system, you’d have to first wait for the entire audio to be received before outputting the symbol.

Another limitation is that the attention model itself is a computational bottleneck since every output token pays attention to every input time step. This makes it harder and slower for the model to do its learning.

Further, as the input is received and becomes longer, the word error rate goes down.

References

You should now generally be up to speed on the 3 most common deep learning-based frameworks for performing automatic speech recognition in a variety of contexts. 

Thanks for reading.

Originally published on heartbeat.fritz.ai

#deep-learning #machine-learning #data-science

What is GEEK

Buddha Community

Top 3 Deep Learning Frameworks for End-to-End Speech Recognition

Top Deep Learning Development Services | Hire Deep Learning Developer

View more: https://www.inexture.com/services/deep-learning-development/

We at Inexture, strategically work on every project we are associated with. We propose a robust set of AI, ML, and DL consulting services. Our virtuoso team of data scientists and developers meticulously work on every project and add a personalized touch to it. Because we keep our clientele aware of everything being done associated with their project so there’s a sense of transparency being maintained. Leverage our services for your next AI project for end-to-end optimum services.

#deep learning development #deep learning framework #deep learning expert #deep learning ai #deep learning services

Mikel  Okuneva

Mikel Okuneva

1603735200

Top 10 Deep Learning Sessions To Look Forward To At DVDC 2020

The Deep Learning DevCon 2020, DLDC 2020, has exciting talks and sessions around the latest developments in the field of deep learning, that will not only be interesting for professionals of this field but also for the enthusiasts who are willing to make a career in the field of deep learning. The two-day conference scheduled for 29th and 30th October will host paper presentations, tech talks, workshops that will uncover some interesting developments as well as the latest research and advancement of this area. Further to this, with deep learning gaining massive traction, this conference will highlight some fascinating use cases across the world.

Here are ten interesting talks and sessions of DLDC 2020 that one should definitely attend:

Also Read: Why Deep Learning DevCon Comes At The Right Time


Adversarial Robustness in Deep Learning

By Dipanjan Sarkar

**About: **Adversarial Robustness in Deep Learning is a session presented by Dipanjan Sarkar, a Data Science Lead at Applied Materials, as well as a Google Developer Expert in Machine Learning. In this session, he will focus on the adversarial robustness in the field of deep learning, where he talks about its importance, different types of adversarial attacks, and will showcase some ways to train the neural networks with adversarial realisation. Considering abstract deep learning has brought us tremendous achievements in the fields of computer vision and natural language processing, this talk will be really interesting for people working in this area. With this session, the attendees will have a comprehensive understanding of adversarial perturbations in the field of deep learning and ways to deal with them with common recipes.

Read an interview with Dipanjan Sarkar.

Imbalance Handling with Combination of Deep Variational Autoencoder and NEATER

By Divye Singh

**About: **Imbalance Handling with Combination of Deep Variational Autoencoder and NEATER is a paper presentation by Divye Singh, who has a masters in technology degree in Mathematical Modeling and Simulation and has the interest to research in the field of artificial intelligence, learning-based systems, machine learning, etc. In this paper presentation, he will talk about the common problem of class imbalance in medical diagnosis and anomaly detection, and how the problem can be solved with a deep learning framework. The talk focuses on the paper, where he has proposed a synergistic over-sampling method generating informative synthetic minority class data by filtering the noise from the over-sampled examples. Further, he will also showcase the experimental results on several real-life imbalanced datasets to prove the effectiveness of the proposed method for binary classification problems.

Default Rate Prediction Models for Self-Employment in Korea using Ridge, Random Forest & Deep Neural Network

By Dongsuk Hong

About: This is a paper presentation given by Dongsuk Hong, who is a PhD in Computer Science, and works in the big data centre of Korea Credit Information Services. This talk will introduce the attendees with machine learning and deep learning models for predicting self-employment default rates using credit information. He will talk about the study, where the DNN model is implemented for two purposes — a sub-model for the selection of credit information variables; and works for cascading to the final model that predicts default rates. Hong’s main research area is data analysis of credit information, where she is particularly interested in evaluating the performance of prediction models based on machine learning and deep learning. This talk will be interesting for the deep learning practitioners who are willing to make a career in this field.


#opinions #attend dldc 2020 #deep learning #deep learning sessions #deep learning talks #dldc 2020 #top deep learning sessions at dldc 2020 #top deep learning talks at dldc 2020

宇野  和也

宇野 和也

1593877440

Indian Accent Speech Recognition

Traditional ASR (Signal Analysis, MFCC, DTW, HMM & Language Modelling) and DNNs (Custom Models & Baidu DeepSpeech Model) on Indian Accent Speech

Courtesy_: _Speech and Music Technology Lab, IIT Madras

Image Courtesy

Notwithstanding an approved Indian-English accent speech, accent-less enunciation is a myth. Irregardless of the racial stereotypes, our speech is naturally shaped by the vernacular we speak, and the Indian vernaculars are numerous! Then how does a computer decipher speech from different Indian states, which even Indians from other states, find ambiguous to understand?

**ASR (Automatic Speech Recognition) **takes any continuous audio speech and output the equivalent text . In this blog, we will explore some challenges in speech recognition with focus on the speaker-independent recognition, both in theory and practice.

The** challenges in ASR** include

  • Variability of volume
  • Variability of words speed
  • Variability of Speaker
  • Variability of** pitch**
  • Word boundaries: we speak words without pause.
  • **Noises **like background sound, audience talks etc.

Lets address** each of the above problems** in the sections discussed below.

The complete source code of the above studies can be found here.

Models in speech recognition can conceptually be divided into:

  • Acoustic model: Turn sound signals into some kind of phonetic representation.
  • Language model: houses domain knowledge of words, grammar, and sentence structure for the language.

Signal Analysis

When we speak we create sinusoidal vibrations in the air. Higher pitches vibrate faster with a higher frequency than lower pitches. A microphone transduce acoustical energy in vibrations to electrical energy.

If we say “Hello World’ then the corresponding signal would contain 2 blobs

Some of the vibrations in the signal have higher amplitude. The amplitude tells us how much acoustical energy is there in the sound

Our speech is made up of many frequencies at the same time, i.e. it is a sum of all those frequencies. To analyze the signal, we use the component frequencies as features. **Fourier transform **is used to break the signal into these components.

We can use this splitting technique to convert the sound to a Spectrogram, where **frequency **on the vertical axis is plotted against time. The intensity of shading indicates the amplitude of the signal.

Spectrogram of the hello world phrase

To create a Spectrogram,

  1. **Divide the signal **into time frames.
  2. Split each frame signal into frequency components with an FFT.
  3. Each time frame is now represented with a** vector of amplitudes** at each frequency.

one dimensional vector for one time frame

If we line up the vectors again in their time series order, we can have a visual picture of the sound components, the Spectrogram.

Spectrogram can be lined up with the original audio signal in time

Next, we’ll look at Feature Extraction techniques which would reduce the noise and dimensionality of our data.

Unnecessary information is encoded in Spectrograph

Feature Extraction with MFCC

Mel Frequency Cepstrum Coefficient Analysis is the reduction of an audio signal to essential speech component features using both Mel frequency analysis and Cepstral analysis. The range of frequencies are reduced and binned into groups of frequencies that humans can distinguish. The signal is further separated into source and filter so that variations between speakers unrelated to articulation can be filtered away.

a) Mel Frequency Analysis

Only **those frequencies humans can hear are **important for recognizing speech. We can split the frequencies of the Spectrogram into bins relevant to our own ears and filter out sound that we can’t hear.

Frequencies above the black line will be filtered out

b) Cepstral Analysis

We also need to separate the elements of sound that are speaker-independent. We can think of a human voice production model as a combination of source and filter, where the source is unique to an individual and the filter is the articulation of words that we all use when speaking.

Cepstral analysis relies on this model for separating the two. The cepstrum can be extracted from a signal with an algorithm. Thus, we drop the component of speech unique to individual vocal chords and preserving the shape of the sound made by the vocal tract.

Cepstral analysis combined with Mel frequency analysis get you 12 or 13 MFCC features related to speech. **Delta and Delta-Delta MFCC features **can optionally be appended to the feature set, effectively doubling (or tripling) the number of features, up to 39 features, but gives better results in ASR.

Thus MFCC (Mel-frequency cepstral coefficients) Features Extraction,

  • Reduced the dimensionality of our data and
  • We squeeze noise out of the system

So there are 2 Acoustic Features for Speech Recognition:

  • Spectrograms
  • Mel-Frequency Cepstral Coefficients (MFCCs):

When you construct your pipeline, you will be able to choose to use either spectrogram or MFCC features. Next, we’ll look at sound from a language perspective, i.e. the phonetics of the words we hear.

Phonetics

Phonetics is the study of sound in human speech. Linguistic analysis is used to break down human words into their smallest sound segments.


phonemes define the distinct sounds

  • Phoneme is the smallest sound segment that can be used to distinguish one word from another.
  • Grapheme, in contrast, is the smallest distinct unit written in a language. Eg: English has 26 alphabets plus a space (27 graphemes).

Unfortunately, we can’t map phonemes to grapheme, as some letters map to multiple phonemes & some phonemes map to many letters. For example, the C letter sounds different in cat, chat, and circle.

Phonemes are often a useful intermediary between speech and text. If we can successfully produce an acoustic model that decodes a sound signal into phonemes the remaining task would be to map those phonemes to their matching words. This step is called Lexical Decoding, named so as it is based on a lexicon or dictionary of the data set.

If we want to train a limited vocabulary of words we might just skip the phonemes. If we have a large vocabulary, then converting to smaller units first, reduces the total number of comparisons needed.

Acoustic Models and the Trouble with Time

With feature extraction, we’ve addressed noise problems as well as variability of speakers. But we still haven’t solved the problem of matching variable lengths of the same word.

Dynamic Time Warping (DTW) calculates the similarity between two signals, even if their time lengths differ. This can be used to align the sequence data of a new word to its most similar counterpart in a dictionary of word examples.

2 signals mapped with Dynamic Time Warping

#deep-speech #speech #deep-learning #speech-recognition #machine-learning #deep learning

Philian Mateo

Philian Mateo

1569222931

Top 3 Deep Learning Frameworks for End-to-End Speech Recognition

Introduction

Speech recognition is invading our lives. It’s built into our phones (Siri), our game consoles (Kinect), our smartwatches (Apple Watch), and even our homes (Amazon Echo). But speech recognition has been around for decades, so why is it just now hitting the mainstream?

The reason is that deep learning finally made speech recognition accurate enough to be useful outside of carefully-controlled environments. In this blog post, we’ll learn how to perform speech recognition with 3 different implementations of popular deep learning frameworks.

Speech Recognition - The Classic Way

In the era of OK Google, I might not really need to define ASR, but here’s a basic description: Say you have a person or an audio source saying something textual, and you have a bunch of microphones that are receiving the audio signals. You can get these signals from one or many devices, and then pass them into an ASR system — whose job it is to infer the original source transcript that the person spoke or that the device played.

So why is ASR important?

Firstly, it’s a very natural interface for human communication. You don’t need a mouse or a keyboard, so it’s obviously a good way to interact with machines. You don’t even really need to learn new techniques because most people learn to speak as a function of natural development. It’s a very natural interface for talking with simple devices such as cars, handheld phones, and chatbots.

So how is this done classically?

As observed above, the classic way of building a speech recognition system is to build a generative model of language. On the rightmost side, you produce a certain sequence of words from language models. And then for each word, you have a pronunciation model that says how this particular word is spoken. Typically it’s written out as the sequence of phonemes — which are basic units of sound, but for our vocabulary, we’ll just say a sequence of tokens — which represent a cluster of things that have been defined by linguistics experts.

Then, the pronunciation models are fed into an acoustic model, which basically defines how does a given token sounds. These acoustic models are now used to describe the data itself. Here the data would be x, which is the sequence of frames of audio features from x1 to xT. Typically, these features are something that signal processing experts have defined (such as the frequency components of the audio waveforms that are captured).

Each of these different components in this pipeline uses a different statistical model:

In the past, language models were typically N-gram models, which worked very well for simple problems with limited speech input data. They are essentially tables describing the probabilities of token sequences.

The pronunciation models were simple lookup tables with probabilities associated with pronunciations. These tables would be very large tables of different pronunciations.

Acoustic models are built using Gaussian Mixture Models with very specific architectures associated with them.

The speech processing was pre-defined.

Once we have this kind of model built, we can perform the recognition by doing the inference on the data received. So you get a waveform, you compute the features for it (X) and do a search for Y that gives the highest probabilities of X.

Machine learning models don’t have to live on servers or in the cloud — they can also live on your smartphone. 

The Neural Network Invasion

Over time, researchers started noticing that each of these components could work more effectively if we used neural networks.

Instead of the N-gram language models, we can build neural language models and feed them into a speech recognition system to restore things that were produced by a first path speech recognition system.

Looking into the pronunciation models, we can figure out how to do pronunciation for a new sequence of characters that we’ve never seen before using a neural network.

For acoustic models, we can build deep neural networks (such as LSTM-based models) to get much better classification accuracy scores of the features for the current frame.

Interestingly enough, even the speech pre-processing steps were found to be replaceable with convolutional neural networks on raw speech signals.

However, there’s still a problem. There are neural networks in each component, but they’re trained independently with different objectives. Because of that, the errors in one component may not behave well with the errors in another component. So that’s the basic motivation for devising a process where you can train the entire model as one big component itself.

These so-called end-to-end models encompass more and more components in the pipeline discussed above. The 2 most popular ones are (1) Connectionist Temporal Classification (CTC), which is in wide usage these days at Baidu and Google, but it requires a lot of training; and (2) Sequence-To-Sequence (Seq-2-Seq), which doesn’t require manual customization.

The basic motivation is that we want to do end-to-end speech recognition. We are given the audio X — which is a sequence of frames from x1 to xT, and the corresponding output text Y — which is a sequence of y1 to yL. Y is just a text sequence (transcript) and X is the audio processed spectrogram. We want to perform speech recognition by learning a probabilistic model p(Y|X): starting with the data and predicting the target sequences themselves.

1. Connectionist Temporal Classification

The first of these models is called Connectionist Temporal Classification (CTC) ([1], [2], [3]). X is a sequence of data frames with length T: x1, x2, …, xT, and Y is the output tokens of length L: y1, y2, …, yL. Because of the way the model is constructed, we require T to be greater than L.

This model has a very specific structure that makes it suitable for speech:

You get the spectrogram at the bottom (X). You feed it into a bi-directional recurrent neural network, and as a result, the arrow pointing at any time step depends on the entirety of the input data. As such, it can compute a fairly complicated function of the entire data X.

This model, at the top, has softmax functions at every timeframe corresponding to the input. The softmax function is applied to a vocabulary with a particular length that you’re interested in. In this case, you have the lowercase letters a to z and some punctuation symbols. So the vocabulary for CTC would be all that and an extra token called a blank token.

Each frame of the prediction is basically producing a log probability for a different token class at that time step. In the case above, a score s(k, t) is the log probability of category k at time step t given the data X.

In a CTC model, if you look at just the softmax functions that are produced by the recurring neural network over the entire time step, you’ll be able to find the probability of the transcript through these individual softmax functions over time.

Let’s take a look at an example (below). The CTC model can represent all these paths through the entire space of softmax functions and look at only the symbols that correspond to each of the time steps.

As seen on the left, the CTC model will go through 2 C symbols, then through a blank symbol, then produce 2 A symbols, then produce another blank symbol, then transition to a T symbol, and then finally produce a blank symbol again.

So when you go through these paths with the constraint, you can only transition between the same phoneme from one step to the next. Therefore, you’ll end up with different ways of representing an output sequence.

For the example above, we have cc <b> aa <b> t <b> or cc <b> <b> a <b> t <b> or cccc <b> aaaa <b> tttt <b>. Given these constraints, it turns out that even though there’s an exponential number of paths by which you can produce the same output symbol, you can actually do it correctly using a dynamic programming algorithm. Because of dynamic programming, it’s possible to compute both the log probability p(Y|X) and its gradient exactly. This gradient can be backpropagated to a neural network whose parameters can then be adjusted by your favorite optimizer!

Below are some results for CTC, which show how the model functions on given audio. A raw waveform is aligned at the bottom, and the corresponding predictions are outputted at the top. You can see that it produces the symbol H at the beginning. At a certain point, it gets a very high probability, which means that the model is confident that it hears the sound corresponding to H.

However, there are some drawbacks to CTC language models. They often misspell words and struggle with grammar. So if you had some way to figure out how to rank the different paths produced from the model and re-rank them just by the language model, the results should be much better.

Google actually fixed these problems by integrating a language model as part of the CTC model itself during training. That’s the kind of production model currently being deployed with OK Google.

2. Online Sequence-to-Sequence

Online sequence-to-sequence models are designed to overcome the limits of sequence-to-sequence models—you don’t want to wait for the entire input sequence to arrive, and you also want to avoid using the attention model itself over the entire sequence. Essentially, the intention is to produce the outputs as the inputs arrive. It has to solve the following problem: is the model ready to produce an output now that it’s received this much input?

The most notable online seq-2-seq model is called a Neural Transducer [5]. If you take the input as it comes in, and every so often at a regular interval, you can run a seq-2-seq model on what’s been received in the last block. As seen in the architecture below, the encoder's attention (instead of looking at the entire input) will focus only on a little block. The transducer will produce the output symbols.

The nice thing about the neural transducer is that it maintains causality. More specifically, the model preserves the disadvantage of a seq-2-seq model. It also introduces an alignment problem: in essence, what you want to know is that you have to produce some symbols as outputs, but you don’t know which chunk these symbols should be aligned to.

You can actually make this model better by incorporating convolutional neural networks, which are borrowed from computer vision. The paper [6] uses CNNs to do the encoder side in speech architecture.

You take the traditional model for the pyramid as seen to the left, and instead of building the pyramid by simply stacking 2 things together, you can put a fancy architecture on top when you do the stacking. More specifically, as seen below, you can stack them as feature maps and put a CNN on the top. For the speech recognition problem, the frequency bands and the timestamps of the features that you look at will correspond to a natural substructure of the input data. The convolutional architecture essentially looks at that substructure.

3. Sequence-To-Sequence

An alternative approach to speech processing is the sequence-to-sequence model that makes next-step predictions. Let’s say that you’re given some data X and that you need to produce some symbols y1 to y{i}. The model predicts the probability of the next symbol of y{i+1}. The goal here is to basically learn a very good model for p.

With the model architecture (left), you have a neural network (which is the decoder in a sequence-to-sequence model) that looks at the entire input (which is the encoder). It feeds in the path symbols that are produced as a recurrent neural network, and then you predict the next token itself as the output.

So this model does speech recognition with the sequence-to-sequence framework. In translation, the X would be the source language. In the speech domain, the X would be a huge sequence of audio that’s now encoded with a recurrent neural network.

What it needs to function is the ability to look at different parts of temporal space, because the input is really long. Intuitively, translation results get worse as the source sentence becomes longer. That’s because it’s really difficult for the model to look in the right place. Turns out, that problem is aggravated a lot more with audio streams that are much longer. Therefore, you would need to implement an attention mechanism if you want to make this model work at all.

Seen in the example on the left, you’re trying to produce the 1st character C. You create an attention vector that essentially looks at different parts of the input time steps and produces the next chapter (which is A) after changing the attention.

If you keep doing this over the entire input stream, then you’re moving forward attention just learned by the model itself. Seen here, it produces the output sequence “cancel, cancel, cancel.”

The Listen, Attend, and Spell [4] model is the canonical model for the seq-2-seq category. Let’s look at the diagram below taken from the paper:

In the Listener architecture, you have an encoder structure. For every time step of the input, it produces a vector representation that encodes the input and is represented as h_t at time step t.

In the Speller architecture, you have a decoder architecture. You generate the next character c_t at every time step t.

The LAS model uses a hierarchical encoder to replace the traditional recurrent neural network. Instead of processing one frame for every time step, it collapses neighboring frames as you feed into the next layer. Because of that, it reduces the number of time steps to be processed, thus making the processing faster.

So what are the limitations of this model?

One of the big limitations preventing its use in an online system is that the output produced is being conditioned on the entire input. That means if you’re going to put the model in a real-world speech recognition system, you’d have to first wait for the entire audio to be received before outputting the symbol.

Another limitation is that the attention model itself is a computational bottleneck since every output token pays attention to every input time step. This makes it harder and slower for the model to do its learning.

Further, as the input is received and becomes longer, the word error rate goes down.

References

You should now generally be up to speed on the 3 most common deep learning-based frameworks for performing automatic speech recognition in a variety of contexts. 

Thanks for reading.

Originally published on heartbeat.fritz.ai

#deep-learning #machine-learning #data-science

Few Shot Learning — A Case Study (2)

In the previous blog, we looked into the fact why Few Shot Learning is essential and what are the applications of it. In this article, I will be explaining the Relation Network for Few-Shot Classification (especially for image classification) in the simplest way possible. Moreover, I will be analyzing the Relation Network in terms of:

  1. Effectiveness of different architectures such as Residual and Inception Networks
  2. Effects of transfer learning via using pre-trained classifier on ImageNet dataset

Moreover, effectiveness will be evaluated on the accuracy, time required for training, and the number of required training parameters.

Please watch the GitHub repository to check out the implementations and keep updated with further experiments.

Introduction to Few-Shot Classification

In few shot classification, our objective is to design a method which can identify any object images by analyzing few sample images of the same class. Let’s the take one example to understand this. Suppose Bob has a client project to design a 5 class classifier, where 5 classes can be anything and these 5 classes can even change with time. As discussed in previous blog, collecting the huge amount of data is very tedious task. Hence, in such cases, Bob will rely upon few shot classification methods where his client can give few set of example images for each classes and after that his system can perform classification young these examples with or without the need of additional training.

In general, in few shot classification four terminologies (N way, K shot, support set, and query set) are used.

  1. N way: It means that there will be total N classes which we will be using for training/testing, like 5 classes in above example.
  2. K shot: Here, K means we have only K example images available for each classes during training/testing.
  3. Support set: It represents a collection of all available K examples images from each classes. Therefore, in support set we have total N*K images.
  4. Query set: This set will have all the images for which we want to predict the respective classes.

At this point, someone new to this concept will have doubt regarding the need of support and query set. So, let’s understand it intuitively. Whenever humans sees any object for the first time, we get the rough idea about that object. Now, in future if we see the same object second time then we will compare it with the image stored in memory from the when we see it for the first time. This applied to all of our surroundings things whether we see, read, or hear. Similarly, to recognise new images from query set, we will provide our model a set of examples i.e., support set to compare.

And this is the basic concept behind Relation Network as well. In next sections, I will be giving the rough idea behind Relation Network and I will be performing different experiments on 102-flower dataset.

About Relation Network

The Core idea behind Relation Network is to learn the generalized image representations for each classes using support set such that we can compare lower dimensional representation of query images with each of the class representations. And based on this comparison decide the class of each query images. Relation Network has two modules which allows us to perform above two tasks:

  1. Embedding module: This module will extract the required underlying representations from each input images irrespective of the their classes.
  2. Relation Module: This module will score the relation of embedding of query image with each class embedding.

Training/Testing procedure:

We can define the whole procedure in just 5 steps.

  1. Use the support set and get underlying representations of each images using embedding module.
  2. Take the average of between each class images and get the single underlying representation for each class.
  3. Then get the embedding for each query images and concatenate them with each class’ embedding.
  4. Use the relation module to get the scores. And class with highest score will be the label of respective query image.
  5. [Only during training] Use MSE loss functions to train both (embedding + relation) modules.

Few things to know during the training is that we will use only images from the set of selective class, and during the testing, we will be using images from unseen classes. For example, from the 102-flower dataset, we will use 50% classes for training, and rest will be used for validation and testing. Moreover, in each episode, we will randomly select 5 classes to create the support and query set and follow the above 5 steps.

That is all need to know about the implementation point of view. Although the whole process is simple and easy to understand, I’ll recommend reading the published research paper, Learning to Compare: Relation Network for Few-Shot Learning, for better understanding.

#deep-learning #few-shot-learning #computer-vision #machine-learning #deep learning #deep learning