Julie  Donnelly

Julie Donnelly


11L – Speech recognition and Graph Transformer Networks

11L – Speech recognition and Graph Transformer Networks


  • 00:00 – Guest lecturer introduction
  • 01:10 – Outline
  • 02:36 – Modern speech recognition
  • 09:26 – Connectionist temporal classification
  • 54:44 – Decoding with beam search (inference)
  • 1:11:09 – Graph Transformer Networks

#developer #data-science

What is GEEK

Buddha Community

11L – Speech recognition and Graph Transformer Networks
Abdul  Larson

Abdul Larson


How to Explain Graph Neural Network — GNNExplainer

Graph Neural Network(GNN) is a type of neural network that can be directly applied to graph-structured data. My previous post gave a brief introduction on GNN. Readers may be directed to this post for more details.

Many research works have shown GNN’s power for understanding graphs, but the way how and why GNN works still remains a mystery for most people. Unlike CNN, where we can extract activation of each layer to visualize the decisions of the network, in GNN it is hard to get a meaningful explanation of what features the network has learnt. Why does GNN determine a node is class A instead of class B? Why does GNN determine a graph is a chemical or molecule? It seems like GNN sees some useful structural information and determines are made upon these observations. But now the problem is, what observations does GNN see?

What is GNNExplainer?

GNNExplainer is introduced in this paper.

Briefly speaking, it is trying to build a network to learn what a GNN has learnt.

The main principle of GNNExplainer is by reducing redundant information in a graph which does not directly impact the decisions. To explain a graph, we want to know what are the crucial features or structures in the graph that affect the decisions of a neural network. If a feature is important, then the prediction should be altered largely by removing or replacing this feature with something else. On the other hand, if removing or altering a feature does not affect the prediction outcome, the feature is considered not essential and thus should not be included in the explanation for a graph.

How does it work?

The primary objective for GNNExplainer is to generate a minimal graph that explains the decision for a node or a graph. To achieve this goal, the problem can be defined as finding a subgraph in the computation graph, that minimizes the difference in the prediction scores using the whole computation graph and the minimal graph. In the paper, this process is formulated as maximizing the mutual information(MI) between the minimal graph Gs and the computation graph G:

Image for post

Besides, there is a secondary objective: the graph needs to be minimal. Though it was also mentioned in the first objective, we need to have a method to formulate this objective as well. The paper addresses it by adding a loss for the number of edges. Therefore, the loss for GNNExplainer is literally the combination of prediction loss and edge size loss.

#graph #graph-neural-networks #graph-theory #pattern-recognition #machine-learning

How Graph Convolutional Networks (GCN) Work

In this post, we’re gonna take a close look at one of the well-known Graph neural networks named GCN. First, we’ll get the intuition to see how it works, then we’ll go deeper into the maths behind it.

Why Graphs?

Many problems are graphs in true nature. In our world, we see many data are graphs, such as molecules, social networks, and paper citations networks.

Image for post

Examples of graphs. (Picture from [1])

Tasks on Graphs

  • Node classification: Predict a type of a given node
  • Link prediction: Predict whether two nodes are linked
  • Community detection: Identify densely linked clusters of nodes
  • Network similarity: How similar are two (sub)networks

Machine Learning Lifecycle

In the graph, we have node features (the data of nodes) and the structure of the graph (how nodes are connected).

For the former, we can easily get the data from each node. But when it comes to the structure, it is not trivial to extract useful information from it. For example, if 2 nodes are close to one another, should we treat them differently to other pairs? How about high and low degree nodes? In fact, each specific task can consume a lot of time and effort just for Feature Engineering, i.e., to distill the structure into our features.

Image for post

Feature engineering on graphs. (Picture from [1])

It would be much better to somehow get both the node features and the structure as the input, and let the machine to figure out what information is useful by itself.

That’s why we need Graph Representation Learning.

Image for post

We want the graph can learn the “feature engineering” by itself. (Picture from [1])

Graph Convolutional Networks (GCNs)

Paper: Semi-supervised Classification with Graph Convolutional Networks(2017) [3]

GCN is a type of convolutional neural network that can work directly on graphs and take advantage of their structural information.

it solves the problem of classifying nodes (such as documents) in a graph (such as a citation network), where labels are only available for a small subset of nodes (semi-supervised learning).

Image for post

Example of Semi-supervised learning on Graphs. Some nodes dont have labels (unknown nodes).

#graph-neural-networks #graph-convolution-network #deep-learning #neural-networks

宇野  和也

宇野 和也


Indian Accent Speech Recognition

Traditional ASR (Signal Analysis, MFCC, DTW, HMM & Language Modelling) and DNNs (Custom Models & Baidu DeepSpeech Model) on Indian Accent Speech

Courtesy_: _Speech and Music Technology Lab, IIT Madras

Image Courtesy

Notwithstanding an approved Indian-English accent speech, accent-less enunciation is a myth. Irregardless of the racial stereotypes, our speech is naturally shaped by the vernacular we speak, and the Indian vernaculars are numerous! Then how does a computer decipher speech from different Indian states, which even Indians from other states, find ambiguous to understand?

**ASR (Automatic Speech Recognition) **takes any continuous audio speech and output the equivalent text . In this blog, we will explore some challenges in speech recognition with focus on the speaker-independent recognition, both in theory and practice.

The** challenges in ASR** include

  • Variability of volume
  • Variability of words speed
  • Variability of Speaker
  • Variability of** pitch**
  • Word boundaries: we speak words without pause.
  • **Noises **like background sound, audience talks etc.

Lets address** each of the above problems** in the sections discussed below.

The complete source code of the above studies can be found here.

Models in speech recognition can conceptually be divided into:

  • Acoustic model: Turn sound signals into some kind of phonetic representation.
  • Language model: houses domain knowledge of words, grammar, and sentence structure for the language.

Signal Analysis

When we speak we create sinusoidal vibrations in the air. Higher pitches vibrate faster with a higher frequency than lower pitches. A microphone transduce acoustical energy in vibrations to electrical energy.

If we say “Hello World’ then the corresponding signal would contain 2 blobs

Some of the vibrations in the signal have higher amplitude. The amplitude tells us how much acoustical energy is there in the sound

Our speech is made up of many frequencies at the same time, i.e. it is a sum of all those frequencies. To analyze the signal, we use the component frequencies as features. **Fourier transform **is used to break the signal into these components.

We can use this splitting technique to convert the sound to a Spectrogram, where **frequency **on the vertical axis is plotted against time. The intensity of shading indicates the amplitude of the signal.

Spectrogram of the hello world phrase

To create a Spectrogram,

  1. **Divide the signal **into time frames.
  2. Split each frame signal into frequency components with an FFT.
  3. Each time frame is now represented with a** vector of amplitudes** at each frequency.

one dimensional vector for one time frame

If we line up the vectors again in their time series order, we can have a visual picture of the sound components, the Spectrogram.

Spectrogram can be lined up with the original audio signal in time

Next, we’ll look at Feature Extraction techniques which would reduce the noise and dimensionality of our data.

Unnecessary information is encoded in Spectrograph

Feature Extraction with MFCC

Mel Frequency Cepstrum Coefficient Analysis is the reduction of an audio signal to essential speech component features using both Mel frequency analysis and Cepstral analysis. The range of frequencies are reduced and binned into groups of frequencies that humans can distinguish. The signal is further separated into source and filter so that variations between speakers unrelated to articulation can be filtered away.

a) Mel Frequency Analysis

Only **those frequencies humans can hear are **important for recognizing speech. We can split the frequencies of the Spectrogram into bins relevant to our own ears and filter out sound that we can’t hear.

Frequencies above the black line will be filtered out

b) Cepstral Analysis

We also need to separate the elements of sound that are speaker-independent. We can think of a human voice production model as a combination of source and filter, where the source is unique to an individual and the filter is the articulation of words that we all use when speaking.

Cepstral analysis relies on this model for separating the two. The cepstrum can be extracted from a signal with an algorithm. Thus, we drop the component of speech unique to individual vocal chords and preserving the shape of the sound made by the vocal tract.

Cepstral analysis combined with Mel frequency analysis get you 12 or 13 MFCC features related to speech. **Delta and Delta-Delta MFCC features **can optionally be appended to the feature set, effectively doubling (or tripling) the number of features, up to 39 features, but gives better results in ASR.

Thus MFCC (Mel-frequency cepstral coefficients) Features Extraction,

  • Reduced the dimensionality of our data and
  • We squeeze noise out of the system

So there are 2 Acoustic Features for Speech Recognition:

  • Spectrograms
  • Mel-Frequency Cepstral Coefficients (MFCCs):

When you construct your pipeline, you will be able to choose to use either spectrogram or MFCC features. Next, we’ll look at sound from a language perspective, i.e. the phonetics of the words we hear.


Phonetics is the study of sound in human speech. Linguistic analysis is used to break down human words into their smallest sound segments.

phonemes define the distinct sounds

  • Phoneme is the smallest sound segment that can be used to distinguish one word from another.
  • Grapheme, in contrast, is the smallest distinct unit written in a language. Eg: English has 26 alphabets plus a space (27 graphemes).

Unfortunately, we can’t map phonemes to grapheme, as some letters map to multiple phonemes & some phonemes map to many letters. For example, the C letter sounds different in cat, chat, and circle.

Phonemes are often a useful intermediary between speech and text. If we can successfully produce an acoustic model that decodes a sound signal into phonemes the remaining task would be to map those phonemes to their matching words. This step is called Lexical Decoding, named so as it is based on a lexicon or dictionary of the data set.

If we want to train a limited vocabulary of words we might just skip the phonemes. If we have a large vocabulary, then converting to smaller units first, reduces the total number of comparisons needed.

Acoustic Models and the Trouble with Time

With feature extraction, we’ve addressed noise problems as well as variability of speakers. But we still haven’t solved the problem of matching variable lengths of the same word.

Dynamic Time Warping (DTW) calculates the similarity between two signals, even if their time lengths differ. This can be used to align the sequence data of a new word to its most similar counterpart in a dictionary of word examples.

2 signals mapped with Dynamic Time Warping

#deep-speech #speech #deep-learning #speech-recognition #machine-learning #deep learning

SimGNN: Similarity Computation via Graph Neural Networks

This post will summarize the paper SimGNN which aims for fast graph similarity computation. Graphs are structures that are used to link different entities that we call nodes using relationships called edges. Graphs exist everywhere from bonds between the atoms to friends on Facebook, all these scenarios can be represented as a graph. One of the fundamental graph problems includes finding similarity between graphs. The similarity between graphs can be defined using these metrics :

  1. Graph Edit Distance
  2. Maximum Common Subgraph

However, currently available algorithms that are used to calculate these metrics have high complexities and it is not yet possible to compute exact GED using these for graphs having more than 16 nodes.

Some ways to compute these metrics are :

  1. Pruning verification Framework
  2. Approximating the GED in fast and heuristic ways

SimGNN follows another approach to tackle this problem i.e turning similarity computation problem into a learning problem.

Before getting into how SimGNN works, we must know the requirements to be satisfied by this model. It includes :

  1. Representation Invariant: Different representations of the same graph should give the same results.
  2. **Inductive: **Should be able to predict results for unseen graphs.
  3. Learnable: Must work on different similarity metrics like GED and MCS

**SimGNN Approach: **To achieve the above-stated requirements, SimGNN uses two strategies

  1. Design Learnable Embedding Function: This maps the graph into an embedding vector, which provides a global summary of a graph. Here, some nodes of importance are selected and used for embedding computation. (less time complexity)
  2. Pair-wise node comparison: The above embedding are too coarse, thus further compute the pairwise similarity scores between nodes from the two graphs, from which the histogram features are extracted and combined with the graph level information. (this is a time-consuming strategy)

#graph-edit-distance #machine-learning #graph-neural-networks #graph-convolution-network

Luna  Mosciski

Luna Mosciski


Do Graph Databases Scale?

Graph Databases are a great solution for many modern use cases: Fraud Detection, Knowledge Graphs, Asset Management, Recommendation Engines, IoT, Permission Management … you name it.

All such projects benefit from a database technology capable of analyzing highly connected data points and their relations fast – Graph databases are designed for these tasks.

But the nature of graph data poses challenges when it comes to buzzword alert scalability. So why is this, and are graph databases capable of scaling? Let’s see…

In the following, we will define what we mean by scaling, take a closer look at two challenges potentially hindering scaling with graph databases, and discuss solutions currently available.

What Is the “Scalability of Graph Databases”?

Let’s quickly define what we mean here by scaling, as it is not “just” putting more data on one machine or throwing it on various ones. What you want when working with large or growing datasets is also an acceptable performance of your queries.

So the real question here is: Are graph databases able to provide acceptable performance when data sets grow on a single machine or even exceed its capabilities?

You might ask why this is a question in the first place. If so, please read the quick recap about graph databases below. If you are already aware of the issues like supernodes and network hops, please skip the quick recap.

Quick Recap About Graph Databases

In a nutshell, graph databases store schema-free objects (vertices or nodes) where arbitrary data can be stored (properties) and relations between the objects (edges). Edges typically have a direction pointing from one object to another. Vertices and edges form a network of data points which is called a “graph”.

In discrete mathematics, a graph is defined as a set of vertices and edges. In computing, it is considered an abstract data type that is good at representing connections or relations – unlike the tabular data structures of relational database systems, which are ironically very limited in expressing relations.

As mentioned above, graphs consist of nodes aka vertices (V) connected by relations aka edges (E).

vertices connected by edges

Vertices can have an arbitrary amount of edges and form paths of arbitrary depth (length of a path).

vertices can have multiple edges and form multiple paths

A financial transaction use case from one bank account to the other could be modeled as a graph as well and look like the schema below. Here you might define bank accounts as nodes and bank transactions among other relationships as edges.

bank example

Storing accounts and transactions in this way will enable us to traverse the created graph is unknown or varying depth. Composing and running such queries in relational databases tends to be a complex endeavor. (Sidenote: With a multi-model database, we could also model the relationship between banks and their branches as a simple relation using joins to query). If you want to learn more about graph databases, you can check out this free course.

Graph databases provide various algorithms to query stored data and analyze relationships. Algorithms may include traversals, pattern matching, shortest path, or distributed graph processing like community detection, connected components, or centrality.

Most algorithms have one thing in common which is also the nature of the supernode and network hop problem. Algorithms traverse from one node via an edge to another node.

After this quick recap, let’s dive into the challenges. First off: celebrities.

Always These Celebrities

As described above, vertices or nodes can have an arbitrary amount of edges. A classic example of the supernode problem is celebrities in a social network. A supernode is a node in a graph dataset with unusually high amounts of incoming or outgoing edges.

For instance, Sir Patrick Stewart’s Twitter account has currently over 3.4 million followers.

patrick stewart's twitter

If we now model accounts and tweets as a graph and traverse through the dataset, we might have to traverse over Patrick Stewart’s account and the traversal algorithms would have to analyze all 3.4m edges pointing to Mr. Steward’s account. This will increase the query execution time significantly and may even exceed acceptable limits. Similar problems can be found in e.g. Fraud Detection (accounts with many transactions), Network Management (large IP hubs), and other cases.

Supernodes are an inherent problem of graphs and pose challenges for all graph databases. But there are two options to minimize the impact of supernodes.

Option 1: Splitting Up Supernodes

To be more precise, we can duplicate the node “Patrick Stewart” and split up a large number of edges by a certain attribute like the country the followers are from or some other form of grouping. Thereby, we minimize the impact of the supernode on traversal performance in case we can make use of these groupings in the queries we want to perform.

Option 2: Vertex-Centric Indexes

vertex-centric index stores information of an edge together with information about the node.

To stay in the example of Patrick Stewart’s Twitter account, depending on the use case, we could use

  • date/time information about when someone started to follow
  • the country the follower is from
  • follower counts of the follower
  • etc.

All these attributes might provide the selectivity we need to effectively use a vertex-centric index.

The query engine can then use the index to lower the number of linear lookups needed to perform the traversal. The same approach can be used e.g. Fraud Detection. Here financial transactions are the edges and we could use the transaction dates or amounts for achieving high selectivity.

There are cases where using the options described above is not suitable and one has to live with some degree of performance degradation when traversing over supernodes. Yet, in most cases, there are options to optimize performance. But there is another challenge, which has not been solved by most graph databases.

#graph database #graph analytics #arangodb #fraud detection #knowledge graph #network management #network latency