Gizzy Berry

Gizzy Berry


How to Build Your Own End-to-End Speech Recognition Model in PyTorch

Deep Learning has changed the game in speech recognition with the introduction of end-to-end models. These models take in audio, and directly output transcriptions. Two of the most popular end-to-end models today are Deep Speech by Baidu, and Listen Attend Spell (LAS) by Google. Both Deep Speech and LAS, are recurrent neural network (RNN) based architectures with different approaches to modeling speech recognition.

Deep Speech uses the Connectionist Temporal Classification (CTC) loss function to predict the speech transcript. LAS uses a sequence to sequence network architecture for its predictions.

These models simplified speech recognition pipelines by taking advantage of the capacity of deep learning system to learn from large datasets. With enough data, you should, in theory, be able to build a super robust speech recognition model that can account for all the nuance in speech without having to spend a ton of time and effort hand engineering acoustic features or dealing with complex pipelines in more old-school GMM-HMM model architectures, for example.

Deep learning is a fast-moving field, and Deep Speech and LAS style architectures are already quickly becoming outdated. You can read about where the industry is moving in the Latest Advancement Section below.

How to Build Your Own End-to-End Speech Recognition Model in PyTorch

Let’s walk through how one would build their own end-to-end speech recognition model in PyTorch. The model we’ll build is inspired by Deep Speech 2 (Baidu’s second revision of their now-famous model) with some personal improvements to the architecture.

The output of the model will be a probability matrix of characters, and we’ll use that probability matrix to decode the most likely characters spoken from the audio. You can find the full code and also run the it with GPU support on Google Colaboratory.

#deep-learning #machine-learning #artificial-intelligence #tensorflow #python

What is GEEK

Buddha Community

How to Build Your Own End-to-End Speech Recognition Model in PyTorch
Hermann  Frami

Hermann Frami


A Simple Wrapper Around Amplify AppSync Simulator

This serverless plugin is a wrapper for amplify-appsync-simulator made for testing AppSync APIs built with serverless-appsync-plugin.


npm install serverless-appsync-simulator
# or
yarn add serverless-appsync-simulator


This plugin relies on your serverless yml file and on the serverless-offline plugin.

  - serverless-dynamodb-local # only if you need dynamodb resolvers and you don't have an external dynamodb
  - serverless-appsync-simulator
  - serverless-offline

Note: Order is important serverless-appsync-simulator must go before serverless-offline

To start the simulator, run the following command:

sls offline start

You should see in the logs something like:

Serverless: AppSync endpoint: http://localhost:20002/graphql
Serverless: GraphiQl: http://localhost:20002


Put options under custom.appsync-simulator in your serverless.yml file

| option | default | description | | ------------------------ | -------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- | | apiKey | 0123456789 | When using API_KEY as authentication type, the key to authenticate to the endpoint. | | port | 20002 | AppSync operations port; if using multiple APIs, the value of this option will be used as a starting point, and each other API will have a port of lastPort + 10 (e.g. 20002, 20012, 20022, etc.) | | wsPort | 20003 | AppSync subscriptions port; if using multiple APIs, the value of this option will be used as a starting point, and each other API will have a port of lastPort + 10 (e.g. 20003, 20013, 20023, etc.) | | location | . (base directory) | Location of the lambda functions handlers. | | refMap | {} | A mapping of resource resolutions for the Ref function | | getAttMap | {} | A mapping of resource resolutions for the GetAtt function | | importValueMap | {} | A mapping of resource resolutions for the ImportValue function | | functions | {} | A mapping of external functions for providing invoke url for external fucntions | | dynamoDb.endpoint | http://localhost:8000 | Dynamodb endpoint. Specify it if you're not using serverless-dynamodb-local. Otherwise, port is taken from dynamodb-local conf | | dynamoDb.region | localhost | Dynamodb region. Specify it if you're connecting to a remote Dynamodb intance. | | dynamoDb.accessKeyId | DEFAULT_ACCESS_KEY | AWS Access Key ID to access DynamoDB | | dynamoDb.secretAccessKey | DEFAULT_SECRET | AWS Secret Key to access DynamoDB | | dynamoDb.sessionToken | DEFAULT_ACCESS_TOKEEN | AWS Session Token to access DynamoDB, only if you have temporary security credentials configured on AWS | | dynamoDb.* | | You can add every configuration accepted by DynamoDB SDK | | rds.dbName | | Name of the database | | rds.dbHost | | Database host | | rds.dbDialect | | Database dialect. Possible values (mysql | postgres) | | rds.dbUsername | | Database username | | rds.dbPassword | | Database password | | rds.dbPort | | Database port | | watch | - *.graphql
- *.vtl | Array of glob patterns to watch for hot-reloading. |


    location: '.webpack/service' # use webpack build directory
      endpoint: 'http://my-custom-dynamo:8000'


By default, the simulator will hot-relad when changes to *.graphql or *.vtl files are detected. Changes to *.yml files are not supported (yet? - this is a Serverless Framework limitation). You will need to restart the simulator each time you change yml files.

Hot-reloading relies on watchman. Make sure it is installed on your system.

You can change the files being watched with the watch option, which is then passed to watchman as the match expression.


      - ["match", "handlers/**/*.vtl", "wholename"] # => array is interpreted as the literal match expression
      - "*.graphql"                                 # => string like this is equivalent to `["match", "*.graphql"]`

Or you can opt-out by leaving an empty array or set the option to false

Note: Functions should not require hot-reloading, unless you are using a transpiler or a bundler (such as webpack, babel or typescript), un which case you should delegate hot-reloading to that instead.

Resource CloudFormation functions resolution

This plugin supports some resources resolution from the Ref, Fn::GetAtt and Fn::ImportValue functions in your yaml file. It also supports some other Cfn functions such as Fn::Join, Fb::Sub, etc.

Note: Under the hood, this features relies on the cfn-resolver-lib package. For more info on supported cfn functions, refer to the documentation

Basic usage

You can reference resources in your functions' environment variables (that will be accessible from your lambda functions) or datasource definitions. The plugin will automatically resolve them for you.

      Ref: MyBucket # resolves to `my-bucket-name`

      Type: AWS::DynamoDB::Table
        TableName: myTable
      Type: AWS::S3::Bucket
        BucketName: my-bucket-name

# in your appsync config
    name: dynamosource
        Ref: MyDbTable # resolves to `myTable`

Override (or mock) values

Sometimes, some references cannot be resolved, as they come from an Output from Cloudformation; or you might want to use mocked values in your local environment.

In those cases, you can define (or override) those values using the refMap, getAttMap and importValueMap options.

  • refMap takes a mapping of resource name to value pairs
  • getAttMap takes a mapping of resource name to attribute/values pairs
  • importValueMap takes a mapping of import name to values pairs


      # Override `MyDbTable` resolution from the previous example.
      MyDbTable: 'mock-myTable'
      # define ElasticSearchInstance DomainName
        DomainEndpoint: 'localhost:9200'
      other-service-api-url: ''

# in your appsync config
    name: elasticsource
      # endpoint resolves as 'http://localhost:9200'
          - ''
          - - https://
            - Fn::GetAtt:
                - ElasticSearchInstance
                - DomainEndpoint

Key-value mock notation

In some special cases you will need to use key-value mock nottation. Good example can be case when you need to include serverless stage value (${self:provider.stage}) in the import name.

This notation can be used with all mocks - refMap, getAttMap and importValueMap

      Fn::ImportValue: other-service-api-${self:provider.stage}-url

      - key: other-service-api-${self:provider.stage}-url
        value: ''


This plugin only tries to resolve the following parts of the yml tree:

  • provider.environment
  • functions[*].environment
  • custom.appSync

If you have the need of resolving others, feel free to open an issue and explain your use case.

For now, the supported resources to be automatically resovled by Ref: are:

  • DynamoDb tables
  • S3 Buckets

Feel free to open a PR or an issue to extend them as well.

External functions

When a function is not defined withing the current serverless file you can still call it by providing an invoke url which should point to a REST method. Make sure you specify "get" or "post" for the method. Default is "get", but you probably want "post".

        url: http://localhost:3016/2015-03-31/functions/addUser/invocations
        method: post
        method: post

Supported Resolver types

This plugin supports resolvers implemented by amplify-appsync-simulator, as well as custom resolvers.

From Aws Amplify:

  • NONE

Implemented by this plugin

  • HTTP

Relational Database

Sample VTL for a create mutation

#set( $cols = [] )
#set( $vals = [] )
#foreach( $entry in $ctx.args.input.keySet() )
  #set( $regex = "([a-z])([A-Z]+)")
  #set( $replacement = "$1_$2")
  #set( $toSnake = $entry.replaceAll($regex, $replacement).toLowerCase() )
  #set( $discard = $cols.add("$toSnake") )
  #if( $util.isBoolean($ctx.args.input[$entry]) )
      #if( $ctx.args.input[$entry] )
        #set( $discard = $vals.add("1") )
        #set( $discard = $vals.add("0") )
      #set( $discard = $vals.add("'$ctx.args.input[$entry]'") )
#set( $valStr = $vals.toString().replace("[","(").replace("]",")") )
#set( $colStr = $cols.toString().replace("[","(").replace("]",")") )
#if ( $valStr.substring(0, 1) != '(' )
  #set( $valStr = "($valStr)" )
#if ( $colStr.substring(0, 1) != '(' )
  #set( $colStr = "($colStr)" )
  "version": "2018-05-29",
  "statements":   ["INSERT INTO <name-of-table> $colStr VALUES $valStr", "SELECT * FROM    <name-of-table> ORDER BY id DESC LIMIT 1"]

Sample VTL for an update mutation

#set( $update = "" )
#set( $equals = "=" )
#foreach( $entry in $ctx.args.input.keySet() )
  #set( $cur = $ctx.args.input[$entry] )
  #set( $regex = "([a-z])([A-Z]+)")
  #set( $replacement = "$1_$2")
  #set( $toSnake = $entry.replaceAll($regex, $replacement).toLowerCase() )
  #if( $util.isBoolean($cur) )
      #if( $cur )
        #set ( $cur = "1" )
        #set ( $cur = "0" )
  #if ( $util.isNullOrEmpty($update) )
      #set($update = "$toSnake$equals'$cur'" )
      #set($update = "$update,$toSnake$equals'$cur'" )
  "version": "2018-05-29",
  "statements":   ["UPDATE <name-of-table> SET $update WHERE id=$", "SELECT * FROM <name-of-table> WHERE id=$"]

Sample resolver for delete mutation

  "version": "2018-05-29",
  "statements":   ["UPDATE <name-of-table> set deleted_at=NOW() WHERE id=$", "SELECT * FROM <name-of-table> WHERE id=$"]

Sample mutation response VTL with support for handling AWSDateTime

#set ( $index = -1)
#set ( $result = $util.parseJson($ctx.result) )
#set ( $meta = $result.sqlStatementResults[1].columnMetadata)
#foreach ($column in $meta)
    #set ($index = $index + 1)
    #if ( $column["typeName"] == "timestamptz" )
        #set ($time = $result["sqlStatementResults"][1]["records"][0][$index]["stringValue"] )
        #set ( $nowEpochMillis = $util.time.parseFormattedToEpochMilliSeconds("$time.substring(0,19)+0000", "yyyy-MM-dd HH:mm:ssZ") )
        #set ( $isoDateTime = $util.time.epochMilliSecondsToISO8601($nowEpochMillis) )
        $util.qr( $result["sqlStatementResults"][1]["records"][0][$index].put("stringValue", "$isoDateTime") )
#set ( $res = $util.parseJson($util.rds.toJsonString($util.toJson($result)))[1][0] )
#set ( $response = {} )
#foreach($mapKey in $res.keySet())
    #set ( $s = $mapKey.split("_") )
    #set ( $camelCase="" )
    #set ( $isFirst=true )
    #foreach($entry in $s)
        #if ( $isFirst )
          #set ( $first = $entry.substring(0,1) )
          #set ( $first = $entry.substring(0,1).toUpperCase() )
        #set ( $isFirst=false )
        #set ( $stringLength = $entry.length() )
        #set ( $remaining = $entry.substring(1, $stringLength) )
        #set ( $camelCase = "$camelCase$first$remaining" )
    $util.qr( $response.put("$camelCase", $res[$mapKey]) )

Using Variable Map

Variable map support is limited and does not differentiate numbers and strings data types, please inject them directly if needed.

Will be escaped properly: null, true, and false values.

  "version": "2018-05-29",
  "statements":   [
    "UPDATE <name-of-table> set deleted_at=NOW() WHERE id=:ID",
    "SELECT * FROM <name-of-table> WHERE id=:ID and unix_timestamp > $ctx.args.newerThan"
  variableMap: {
    ":ID": $,
##    ":TIMESTAMP": $ctx.args.newerThan -- This will be handled as a string!!!


Author: Serverless-appsync
Source Code: 
License: MIT License

#serverless #sync #graphql 

Generis: Versatile Go Code Generator


Versatile Go code generator.


Generis is a lightweight code preprocessor adding the following features to the Go language :

  • Generics.
  • Free-form macros.
  • Conditional compilation.
  • HTML templating.
  • Allman style conversion.


package main;


import (


#define DebugMode
#as true

// ~~

#define HttpPort
#as 8080

// ~~

#define WriteLine( {{text}} )
#as log.Println( {{text}} )

// ~~

#define local {{variable}} : {{type}};
#as var {{variable}} {{type}};

// ~~

#define DeclareStack( {{type}}, {{name}} )
    // -- TYPES

    type {{name}}Stack struct
        ElementArray []{{type}};

    // -- INQUIRIES

    func ( stack * {{name}}Stack ) IsEmpty(
        ) bool
        return len( stack.ElementArray ) == 0;

    // -- OPERATIONS

    func ( stack * {{name}}Stack ) Push(
        element {{type}}
        stack.ElementArray = append( stack.ElementArray, element );

    // ~~

    func ( stack * {{name}}Stack ) Pop(
        ) {{type}}
            element : {{type}};

        element = stack.ElementArray[ len( stack.ElementArray ) - 1 ];

        stack.ElementArray = stack.ElementArray[ : len( stack.ElementArray ) - 1 ];

        return element;

// ~~

#define DeclareStack( {{type}} )
#as DeclareStack( {{type}}, {{type:PascalCase}} )

// -- TYPES

DeclareStack( string )
DeclareStack( int32 )


func HandleRootPage(
    response_writer http.ResponseWriter,
    request * http.Request
        boolean : bool;
        natural : uint;
        integer : int;
        real : float64;
        text : string;
        integer_stack : Int32Stack;

    boolean = true;
    natural = 10;
    integer = 20;
    real = 30.0;
    text = "text";
    escaped_url_text = "&escaped text?";
    escaped_html_text = "<escaped text/>";

    integer_stack.Push( 10 );
    integer_stack.Push( 20 );
    integer_stack.Push( 30 );

    #write response_writer
        <!DOCTYPE html>
        <html lang="en">
                <meta charset="utf-8">
                <title><%= request.URL.Path %></title>
                <% if ( boolean ) { %>
                    <%= "URL : " + request.URL.Path %>
                    <%@ natural %>
                    <%# integer %>
                    <%& real %>
                    <%~ text %>
                    <%^ escaped_url_text %>
                    <%= escaped_html_text %>
                    <%= "<%% ignored %%>" %>
                    <%% ignored %%>
                <% } %>
                Stack :
                <% for !integer_stack.IsEmpty() { %>
                    <%# integer_stack.Pop() %>
                <% } %>

// ~~

func main()
    http.HandleFunc( "/", HandleRootPage );

    #if DebugMode
        WriteLine( "Listening on http://localhost:HttpPort" );

        http.ListenAndServe( ":HttpPort", nil )


#define directive

Constants and generic code can be defined with the following syntax :

#define old code
#as new code

#define old code

#as new code


#define parameter

The #define directive can contain one or several parameters :

{{variable name}} : hierarchical code (with properly matching brackets and parentheses)
{{variable name#}} : statement code (hierarchical code without semicolon)
{{variable name$}} : plain code
{{variable name:boolean expression}} : conditional hierarchical code
{{variable name#:boolean expression}} : conditional statement code
{{variable name$:boolean expression}} : conditional plain code

They can have a boolean expression to require they match specific conditions :

HasText text
HasPrefix prefix
HasSuffix suffix
HasIdentifier text
expression && expression
expression || expression
( expression )

The #define directive must not start or end with a parameter.

#as parameter

The #as directive can use the value of the #define parameters :

{{variable name}}
{{variable name:filter function}}
{{variable name:filter function:filter function:...}}

Their value can be changed through one or several filter functions :

ReplacePrefix old_prefix new_prefix
ReplaceSuffix old_suffix new_suffix
ReplaceText old_text new_text
ReplaceIdentifier old_identifier new_identifier
AddPrefix prefix
AddSuffix suffix
RemovePrefix prefix
RemoveSuffix suffix
RemoveText text
RemoveIdentifier identifier

#if directive

Conditional code can be defined with the following syntax :

#if boolean expression
    #if boolean expression
    #if boolean expression

The boolean expression can use the following operators :

expression && expression
expression || expression
( expression )

#write directive

Templated HTML code can be sent to a stream writer using the following syntax :

#write writer expression
    <% code %>
    <%@ natural expression %>
    <%# integer expression %>
    <%& real expression %>
    <%~ text expression %>
    <%= escaped text expression %>
    <%! removed content %>
    <%% ignored tags %%>


  • There is no operator precedence in boolean expressions.
  • The --join option requires to end the statements with a semicolon.
  • The #writer directive is only available for the Go language.


Install the DMD 2 compiler (using the MinGW setup option on Windows).

Build the executable with the following command line :

dmd -m64 generis.d

Command line

generis [options]


--prefix # : set the command prefix
--parse INPUT_FOLDER/ : parse the definitions of the Generis files in the input folder
--process INPUT_FOLDER/ OUTPUT_FOLDER/ : reads the Generis files in the input folder and writes the processed files in the output folder
--trim : trim the HTML templates
--join : join the split statements
--create : create the output folders if needed
--watch : watch the Generis files for modifications
--pause 500 : time to wait before checking the Generis files again
--tabulation 4 : set the tabulation space count
--extension .go : generate files with this extension


generis --process GS/ GO/

Reads the Generis files in the GS/ folder and writes Go files in the GO/ folder.

generis --process GS/ GO/ --create

Reads the Generis files in the GS/ folder and writes Go files in the GO/ folder, creating the output folders if needed.

generis --process GS/ GO/ --create --watch

Reads the Generis files in the GS/ folder and writes Go files in the GO/ folder, creating the output folders if needed and watching the Generis files for modifications.

generis --process GS/ GO/ --trim --join --create --watch

Reads the Generis files in the GS/ folder and writes Go files in the GO/ folder, trimming the HTML templates, joining the split statements, creating the output folders if needed and watching the Generis files for modifications.



Author: Senselogic
Source Code: 
License: View license

#go #golang #code 

HI Python

HI Python


How to Use ASR System for Accurate Transcription Properties of Your Digital Product

Thanks to advances in speech recognition, companies can now build a whole range of products with accurate transcription capabilities at their heart. Conversation intelligence platforms, personal assistants and video and audio editing tools, for example, all rely on speech to text transcription. However, you often need to train these systems for every domain you want to transcribe, using supervised data. In practice, you need a large body of transcribed audio that’s similar to what you are transcribing just to get started in a new domain.

Recently, Facebook released wav2vec 2.0 which goes some way towards addressing this challenge. wav2vec 2.0 allows you to pre-train transcription systems using _audio only — _with no corresponding transcription — and then use just a tiny transcribed dataset for training.

In this blog, we share how we worked with wav2vec 2.0, with great results.

#speech-to-text-recognition #speech-recognition #machine-learning #artificial-intelligence #python #pytorch #speech-recognition-in-python #hackernoon-top-story

Edna  Bernhard

Edna Bernhard


Speech Analytics, Sound Analytics in TorchAudio

The landscape of DataScience is changing everyday. In the last few years we have seen numerous number of research and advancement in NLP and Computer Vision field. But there is a field which is still unexplored and has a lot of potential , the field is — SPEECH.

In the last tutorial we have learned about:

  1. What is a Sound wave?

2. The basic properties of sound wave

3. Feature Extraction from sound wave

4. Pre-Processing a sound wave

In this tutorial we would be looking into the practical application of it in Python. The two most popular libraries to help you in your journey are:

  1. Librosa
  2. TorchAudio

TorchAudio — It is a PyTorch domain library consisting of I/O, popular datasets and common audio transformations that can bring new speed and efficiency to your PyTorch projects. It is one of the powerful speech modulation software created by Facebook.

#speech-analytics #speech-recognition #machine-learning-ai #pytorch #speech #ai

宇野  和也

宇野 和也


Indian Accent Speech Recognition

Traditional ASR (Signal Analysis, MFCC, DTW, HMM & Language Modelling) and DNNs (Custom Models & Baidu DeepSpeech Model) on Indian Accent Speech

Courtesy_: _Speech and Music Technology Lab, IIT Madras

Image Courtesy

Notwithstanding an approved Indian-English accent speech, accent-less enunciation is a myth. Irregardless of the racial stereotypes, our speech is naturally shaped by the vernacular we speak, and the Indian vernaculars are numerous! Then how does a computer decipher speech from different Indian states, which even Indians from other states, find ambiguous to understand?

**ASR (Automatic Speech Recognition) **takes any continuous audio speech and output the equivalent text . In this blog, we will explore some challenges in speech recognition with focus on the speaker-independent recognition, both in theory and practice.

The** challenges in ASR** include

  • Variability of volume
  • Variability of words speed
  • Variability of Speaker
  • Variability of** pitch**
  • Word boundaries: we speak words without pause.
  • **Noises **like background sound, audience talks etc.

Lets address** each of the above problems** in the sections discussed below.

The complete source code of the above studies can be found here.

Models in speech recognition can conceptually be divided into:

  • Acoustic model: Turn sound signals into some kind of phonetic representation.
  • Language model: houses domain knowledge of words, grammar, and sentence structure for the language.

Signal Analysis

When we speak we create sinusoidal vibrations in the air. Higher pitches vibrate faster with a higher frequency than lower pitches. A microphone transduce acoustical energy in vibrations to electrical energy.

If we say “Hello World’ then the corresponding signal would contain 2 blobs

Some of the vibrations in the signal have higher amplitude. The amplitude tells us how much acoustical energy is there in the sound

Our speech is made up of many frequencies at the same time, i.e. it is a sum of all those frequencies. To analyze the signal, we use the component frequencies as features. **Fourier transform **is used to break the signal into these components.

We can use this splitting technique to convert the sound to a Spectrogram, where **frequency **on the vertical axis is plotted against time. The intensity of shading indicates the amplitude of the signal.

Spectrogram of the hello world phrase

To create a Spectrogram,

  1. **Divide the signal **into time frames.
  2. Split each frame signal into frequency components with an FFT.
  3. Each time frame is now represented with a** vector of amplitudes** at each frequency.

one dimensional vector for one time frame

If we line up the vectors again in their time series order, we can have a visual picture of the sound components, the Spectrogram.

Spectrogram can be lined up with the original audio signal in time

Next, we’ll look at Feature Extraction techniques which would reduce the noise and dimensionality of our data.

Unnecessary information is encoded in Spectrograph

Feature Extraction with MFCC

Mel Frequency Cepstrum Coefficient Analysis is the reduction of an audio signal to essential speech component features using both Mel frequency analysis and Cepstral analysis. The range of frequencies are reduced and binned into groups of frequencies that humans can distinguish. The signal is further separated into source and filter so that variations between speakers unrelated to articulation can be filtered away.

a) Mel Frequency Analysis

Only **those frequencies humans can hear are **important for recognizing speech. We can split the frequencies of the Spectrogram into bins relevant to our own ears and filter out sound that we can’t hear.

Frequencies above the black line will be filtered out

b) Cepstral Analysis

We also need to separate the elements of sound that are speaker-independent. We can think of a human voice production model as a combination of source and filter, where the source is unique to an individual and the filter is the articulation of words that we all use when speaking.

Cepstral analysis relies on this model for separating the two. The cepstrum can be extracted from a signal with an algorithm. Thus, we drop the component of speech unique to individual vocal chords and preserving the shape of the sound made by the vocal tract.

Cepstral analysis combined with Mel frequency analysis get you 12 or 13 MFCC features related to speech. **Delta and Delta-Delta MFCC features **can optionally be appended to the feature set, effectively doubling (or tripling) the number of features, up to 39 features, but gives better results in ASR.

Thus MFCC (Mel-frequency cepstral coefficients) Features Extraction,

  • Reduced the dimensionality of our data and
  • We squeeze noise out of the system

So there are 2 Acoustic Features for Speech Recognition:

  • Spectrograms
  • Mel-Frequency Cepstral Coefficients (MFCCs):

When you construct your pipeline, you will be able to choose to use either spectrogram or MFCC features. Next, we’ll look at sound from a language perspective, i.e. the phonetics of the words we hear.


Phonetics is the study of sound in human speech. Linguistic analysis is used to break down human words into their smallest sound segments.

phonemes define the distinct sounds

  • Phoneme is the smallest sound segment that can be used to distinguish one word from another.
  • Grapheme, in contrast, is the smallest distinct unit written in a language. Eg: English has 26 alphabets plus a space (27 graphemes).

Unfortunately, we can’t map phonemes to grapheme, as some letters map to multiple phonemes & some phonemes map to many letters. For example, the C letter sounds different in cat, chat, and circle.

Phonemes are often a useful intermediary between speech and text. If we can successfully produce an acoustic model that decodes a sound signal into phonemes the remaining task would be to map those phonemes to their matching words. This step is called Lexical Decoding, named so as it is based on a lexicon or dictionary of the data set.

If we want to train a limited vocabulary of words we might just skip the phonemes. If we have a large vocabulary, then converting to smaller units first, reduces the total number of comparisons needed.

Acoustic Models and the Trouble with Time

With feature extraction, we’ve addressed noise problems as well as variability of speakers. But we still haven’t solved the problem of matching variable lengths of the same word.

Dynamic Time Warping (DTW) calculates the similarity between two signals, even if their time lengths differ. This can be used to align the sequence data of a new word to its most similar counterpart in a dictionary of word examples.

2 signals mapped with Dynamic Time Warping

#deep-speech #speech #deep-learning #speech-recognition #machine-learning #deep learning