宇野  和也

宇野 和也

1593877440

Indian Accent Speech Recognition

Traditional ASR (Signal Analysis, MFCC, DTW, HMM & Language Modelling) and DNNs (Custom Models & Baidu DeepSpeech Model) on Indian Accent Speech

Courtesy_: _Speech and Music Technology Lab, IIT Madras

Image Courtesy

Notwithstanding an approved Indian-English accent speech, accent-less enunciation is a myth. Irregardless of the racial stereotypes, our speech is naturally shaped by the vernacular we speak, and the Indian vernaculars are numerous! Then how does a computer decipher speech from different Indian states, which even Indians from other states, find ambiguous to understand?

**ASR (Automatic Speech Recognition) **takes any continuous audio speech and output the equivalent text . In this blog, we will explore some challenges in speech recognition with focus on the speaker-independent recognition, both in theory and practice.

The** challenges in ASR** include

  • Variability of volume
  • Variability of words speed
  • Variability of Speaker
  • Variability of** pitch**
  • Word boundaries: we speak words without pause.
  • **Noises **like background sound, audience talks etc.

Lets address** each of the above problems** in the sections discussed below.

The complete source code of the above studies can be found here.

Models in speech recognition can conceptually be divided into:

  • Acoustic model: Turn sound signals into some kind of phonetic representation.
  • Language model: houses domain knowledge of words, grammar, and sentence structure for the language.

Signal Analysis

When we speak we create sinusoidal vibrations in the air. Higher pitches vibrate faster with a higher frequency than lower pitches. A microphone transduce acoustical energy in vibrations to electrical energy.

If we say “Hello World’ then the corresponding signal would contain 2 blobs

Some of the vibrations in the signal have higher amplitude. The amplitude tells us how much acoustical energy is there in the sound

Our speech is made up of many frequencies at the same time, i.e. it is a sum of all those frequencies. To analyze the signal, we use the component frequencies as features. **Fourier transform **is used to break the signal into these components.

We can use this splitting technique to convert the sound to a Spectrogram, where **frequency **on the vertical axis is plotted against time. The intensity of shading indicates the amplitude of the signal.

Spectrogram of the hello world phrase

To create a Spectrogram,

  1. **Divide the signal **into time frames.
  2. Split each frame signal into frequency components with an FFT.
  3. Each time frame is now represented with a** vector of amplitudes** at each frequency.

one dimensional vector for one time frame

If we line up the vectors again in their time series order, we can have a visual picture of the sound components, the Spectrogram.

Spectrogram can be lined up with the original audio signal in time

Next, we’ll look at Feature Extraction techniques which would reduce the noise and dimensionality of our data.

Unnecessary information is encoded in Spectrograph

Feature Extraction with MFCC

Mel Frequency Cepstrum Coefficient Analysis is the reduction of an audio signal to essential speech component features using both Mel frequency analysis and Cepstral analysis. The range of frequencies are reduced and binned into groups of frequencies that humans can distinguish. The signal is further separated into source and filter so that variations between speakers unrelated to articulation can be filtered away.

a) Mel Frequency Analysis

Only **those frequencies humans can hear are **important for recognizing speech. We can split the frequencies of the Spectrogram into bins relevant to our own ears and filter out sound that we can’t hear.

Frequencies above the black line will be filtered out

b) Cepstral Analysis

We also need to separate the elements of sound that are speaker-independent. We can think of a human voice production model as a combination of source and filter, where the source is unique to an individual and the filter is the articulation of words that we all use when speaking.

Cepstral analysis relies on this model for separating the two. The cepstrum can be extracted from a signal with an algorithm. Thus, we drop the component of speech unique to individual vocal chords and preserving the shape of the sound made by the vocal tract.

Cepstral analysis combined with Mel frequency analysis get you 12 or 13 MFCC features related to speech. **Delta and Delta-Delta MFCC features **can optionally be appended to the feature set, effectively doubling (or tripling) the number of features, up to 39 features, but gives better results in ASR.

Thus MFCC (Mel-frequency cepstral coefficients) Features Extraction,

  • Reduced the dimensionality of our data and
  • We squeeze noise out of the system

So there are 2 Acoustic Features for Speech Recognition:

  • Spectrograms
  • Mel-Frequency Cepstral Coefficients (MFCCs):

When you construct your pipeline, you will be able to choose to use either spectrogram or MFCC features. Next, we’ll look at sound from a language perspective, i.e. the phonetics of the words we hear.

Phonetics

Phonetics is the study of sound in human speech. Linguistic analysis is used to break down human words into their smallest sound segments.


phonemes define the distinct sounds

  • Phoneme is the smallest sound segment that can be used to distinguish one word from another.
  • Grapheme, in contrast, is the smallest distinct unit written in a language. Eg: English has 26 alphabets plus a space (27 graphemes).

Unfortunately, we can’t map phonemes to grapheme, as some letters map to multiple phonemes & some phonemes map to many letters. For example, the C letter sounds different in cat, chat, and circle.

Phonemes are often a useful intermediary between speech and text. If we can successfully produce an acoustic model that decodes a sound signal into phonemes the remaining task would be to map those phonemes to their matching words. This step is called Lexical Decoding, named so as it is based on a lexicon or dictionary of the data set.

If we want to train a limited vocabulary of words we might just skip the phonemes. If we have a large vocabulary, then converting to smaller units first, reduces the total number of comparisons needed.

Acoustic Models and the Trouble with Time

With feature extraction, we’ve addressed noise problems as well as variability of speakers. But we still haven’t solved the problem of matching variable lengths of the same word.

Dynamic Time Warping (DTW) calculates the similarity between two signals, even if their time lengths differ. This can be used to align the sequence data of a new word to its most similar counterpart in a dictionary of word examples.

2 signals mapped with Dynamic Time Warping

#deep-speech #speech #deep-learning #speech-recognition #machine-learning #deep learning

What is GEEK

Buddha Community

Indian Accent Speech Recognition
宇野  和也

宇野 和也

1593877440

Indian Accent Speech Recognition

Traditional ASR (Signal Analysis, MFCC, DTW, HMM & Language Modelling) and DNNs (Custom Models & Baidu DeepSpeech Model) on Indian Accent Speech

Courtesy_: _Speech and Music Technology Lab, IIT Madras

Image Courtesy

Notwithstanding an approved Indian-English accent speech, accent-less enunciation is a myth. Irregardless of the racial stereotypes, our speech is naturally shaped by the vernacular we speak, and the Indian vernaculars are numerous! Then how does a computer decipher speech from different Indian states, which even Indians from other states, find ambiguous to understand?

**ASR (Automatic Speech Recognition) **takes any continuous audio speech and output the equivalent text . In this blog, we will explore some challenges in speech recognition with focus on the speaker-independent recognition, both in theory and practice.

The** challenges in ASR** include

  • Variability of volume
  • Variability of words speed
  • Variability of Speaker
  • Variability of** pitch**
  • Word boundaries: we speak words without pause.
  • **Noises **like background sound, audience talks etc.

Lets address** each of the above problems** in the sections discussed below.

The complete source code of the above studies can be found here.

Models in speech recognition can conceptually be divided into:

  • Acoustic model: Turn sound signals into some kind of phonetic representation.
  • Language model: houses domain knowledge of words, grammar, and sentence structure for the language.

Signal Analysis

When we speak we create sinusoidal vibrations in the air. Higher pitches vibrate faster with a higher frequency than lower pitches. A microphone transduce acoustical energy in vibrations to electrical energy.

If we say “Hello World’ then the corresponding signal would contain 2 blobs

Some of the vibrations in the signal have higher amplitude. The amplitude tells us how much acoustical energy is there in the sound

Our speech is made up of many frequencies at the same time, i.e. it is a sum of all those frequencies. To analyze the signal, we use the component frequencies as features. **Fourier transform **is used to break the signal into these components.

We can use this splitting technique to convert the sound to a Spectrogram, where **frequency **on the vertical axis is plotted against time. The intensity of shading indicates the amplitude of the signal.

Spectrogram of the hello world phrase

To create a Spectrogram,

  1. **Divide the signal **into time frames.
  2. Split each frame signal into frequency components with an FFT.
  3. Each time frame is now represented with a** vector of amplitudes** at each frequency.

one dimensional vector for one time frame

If we line up the vectors again in their time series order, we can have a visual picture of the sound components, the Spectrogram.

Spectrogram can be lined up with the original audio signal in time

Next, we’ll look at Feature Extraction techniques which would reduce the noise and dimensionality of our data.

Unnecessary information is encoded in Spectrograph

Feature Extraction with MFCC

Mel Frequency Cepstrum Coefficient Analysis is the reduction of an audio signal to essential speech component features using both Mel frequency analysis and Cepstral analysis. The range of frequencies are reduced and binned into groups of frequencies that humans can distinguish. The signal is further separated into source and filter so that variations between speakers unrelated to articulation can be filtered away.

a) Mel Frequency Analysis

Only **those frequencies humans can hear are **important for recognizing speech. We can split the frequencies of the Spectrogram into bins relevant to our own ears and filter out sound that we can’t hear.

Frequencies above the black line will be filtered out

b) Cepstral Analysis

We also need to separate the elements of sound that are speaker-independent. We can think of a human voice production model as a combination of source and filter, where the source is unique to an individual and the filter is the articulation of words that we all use when speaking.

Cepstral analysis relies on this model for separating the two. The cepstrum can be extracted from a signal with an algorithm. Thus, we drop the component of speech unique to individual vocal chords and preserving the shape of the sound made by the vocal tract.

Cepstral analysis combined with Mel frequency analysis get you 12 or 13 MFCC features related to speech. **Delta and Delta-Delta MFCC features **can optionally be appended to the feature set, effectively doubling (or tripling) the number of features, up to 39 features, but gives better results in ASR.

Thus MFCC (Mel-frequency cepstral coefficients) Features Extraction,

  • Reduced the dimensionality of our data and
  • We squeeze noise out of the system

So there are 2 Acoustic Features for Speech Recognition:

  • Spectrograms
  • Mel-Frequency Cepstral Coefficients (MFCCs):

When you construct your pipeline, you will be able to choose to use either spectrogram or MFCC features. Next, we’ll look at sound from a language perspective, i.e. the phonetics of the words we hear.

Phonetics

Phonetics is the study of sound in human speech. Linguistic analysis is used to break down human words into their smallest sound segments.


phonemes define the distinct sounds

  • Phoneme is the smallest sound segment that can be used to distinguish one word from another.
  • Grapheme, in contrast, is the smallest distinct unit written in a language. Eg: English has 26 alphabets plus a space (27 graphemes).

Unfortunately, we can’t map phonemes to grapheme, as some letters map to multiple phonemes & some phonemes map to many letters. For example, the C letter sounds different in cat, chat, and circle.

Phonemes are often a useful intermediary between speech and text. If we can successfully produce an acoustic model that decodes a sound signal into phonemes the remaining task would be to map those phonemes to their matching words. This step is called Lexical Decoding, named so as it is based on a lexicon or dictionary of the data set.

If we want to train a limited vocabulary of words we might just skip the phonemes. If we have a large vocabulary, then converting to smaller units first, reduces the total number of comparisons needed.

Acoustic Models and the Trouble with Time

With feature extraction, we’ve addressed noise problems as well as variability of speakers. But we still haven’t solved the problem of matching variable lengths of the same word.

Dynamic Time Warping (DTW) calculates the similarity between two signals, even if their time lengths differ. This can be used to align the sequence data of a new word to its most similar counterpart in a dictionary of word examples.

2 signals mapped with Dynamic Time Warping

#deep-speech #speech #deep-learning #speech-recognition #machine-learning #deep learning

How Indian Railways Uses AI: A Comprehensive Case Study

For decades, the Indian Railways has been a significant and sole way for commutation in the country. Being the third-largest railway network in the world by size, it accommodates and offers its services to people from all walks of life.

With that being said, such a mammoth size of the railway system needs several aspects of the working to align for the system to function smoothly. And, even the slightest of disruption in the network can create hassle in the schedule, creating inconvenience to lakhs of people. Thus, to make the running of the system run smoother, there was a requirement for Indian Railways to deploy a consolidated system that runs on automation to get the job done in real-time.

#opinions #ai indian railways #indian railways #indian railways ai #indian railways artificial intelligence #indian railways using ai #iot in indian railways

Madaline  Mertz

Madaline Mertz

1621673040

How to Use ASR System for Accurate Transcription Properties of Your Digital Product

Thanks to advances in speech recognition, companies can now build a whole range of products with accurate transcription capabilities at their heart. Conversation intelligence platforms, personal assistants and video and audio editing tools, for example, all rely on speech to text transcription. However, you often need to train these systems for every domain you want to transcribe, using supervised data. In practice, you need a large body of transcribed audio that’s similar to what you are transcribing just to get started in a new domain.

Recently, Facebook released wav2vec 2.0 which goes some way towards addressing this challenge. wav2vec 2.0 allows you to pre-train transcription systems using _audio only — _with no corresponding transcription — and then use just a tiny transcribed dataset for training.

In this blog, we share how we worked with wav2vec 2.0, with great results.

#speech-to-text-recognition #speech-recognition #machine-learning #artificial-intelligence #python #pytorch #speech-recognition-in-python #hackernoon-top-story

Christa  Stehr

Christa Stehr

1597194889

Creating a Speech Recognition App in Angular

We can extract text data from a speech by using speech recognition methods. There are many ways to carry out speech recognition in Angular, however, I’d like to focus on a simple method for this.

Here we use “Web Speech API” to recognize speech. Unfortunately, this API is only supported for a few browsers so I will list the supported browsers below:

  • Google chrome
  • Chrome for Android
  • Samsung Internet
  • QQ Browser
  • Baidu Browser

You can test this app in a browser from this list.

Okay, let’s begin. First of all, we have to create a new Angular project by using the below command in the terminal. I assume that you have installed Angular-CLI, but if you haven’t then the below command won’t work.

ng g new voice-recognition
cd ./voice-recognition

#speech-recognition #angular #web-speech-api #api #voice-recognition

Edna  Bernhard

Edna Bernhard

1598280469

Speech Analytics, Sound Analytics in TorchAudio

The landscape of DataScience is changing everyday. In the last few years we have seen numerous number of research and advancement in NLP and Computer Vision field. But there is a field which is still unexplored and has a lot of potential , the field is — SPEECH.

In the last tutorial we have learned about:

  1. What is a Sound wave?

2. The basic properties of sound wave

3. Feature Extraction from sound wave

4. Pre-Processing a sound wave

In this tutorial we would be looking into the practical application of it in Python. The two most popular libraries to help you in your journey are:

  1. Librosa
  2. TorchAudio

TorchAudio — It is a PyTorch domain library consisting of I/O, popular datasets and common audio transformations that can bring new speed and efficiency to your PyTorch projects. It is one of the powerful speech modulation software created by Facebook.

#speech-analytics #speech-recognition #machine-learning-ai #pytorch #speech #ai