Whisper: Robust Speech Recognition via Large-Scale Weak Supervision


Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.



A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing a single model to replace many stages of a traditional speech-processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.


We used Python 3.9.9 and PyTorch 1.10.1 to train and test our models, but the codebase is expected to be compatible with Python 3.8-3.10 and recent PyTorch versions. The codebase also depends on a few Python packages, most notably HuggingFace Transformers for their fast tokenizer implementation and ffmpeg-python for reading audio files. You can download and install (or update to) the latest release of Whisper with the following command:

pip install -U openai-whisper

Alternatively, the following command will pull and install the latest commit from this repository, along with its Python dependencies:

pip install git+https://github.com/openai/whisper.git 

To update the package to the latest version of this repository, please run:

pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git

It also requires the command-line tool ffmpeg to be installed on your system, which is available from most package managers:

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

You may need rust installed as well, in case tokenizers does not provide a pre-built wheel for your platform. If you see installation errors during the pip install command above, please follow the Getting started page to install Rust development environment. Additionally, you may need to configure the PATH environment variable, e.g. export PATH="$HOME/.cargo/bin:$PATH". If the installation fails with No module named 'setuptools_rust', you need to install setuptools_rust, e.g. by running:

pip install setuptools-rust

Available models and languages

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed.

SizeParametersEnglish-only modelMultilingual modelRequired VRAMRelative speed
tiny39 Mtiny.entiny~1 GB~32x
base74 Mbase.enbase~1 GB~16x
small244 Msmall.ensmall~2 GB~6x
medium769 Mmedium.enmedium~5 GB~2x
large1550 MN/Alarge~10 GB1x

The .en models for English-only applications tend to perform better, especially for the tiny.en and base.en models. We observed that the difference becomes less significant for the small.en and medium.en models.

Whisper's performance varies widely depending on the language. The figure below shows a WER (Word Error Rate) breakdown by languages of the Fleurs dataset using the large-v2 model. More WER and BLEU scores corresponding to the other models and datasets can be found in Appendix D in the paper. The smaller, the better.

WER breakdown by language

Command-line usage

The following command will transcribe speech in audio files, using the medium model:

whisper audio.flac audio.mp3 audio.wav --model medium

The default setting (which selects the small model) works well for transcribing English. To transcribe an audio file containing non-English speech, you can specify the language using the --language option:

whisper japanese.wav --language Japanese

Adding --task translate will translate the speech into English:

whisper japanese.wav --language Japanese --task translate

Run the following to view all available options:

whisper --help

See tokenizer.py for the list of all available languages.

Python usage

Transcription can also be performed within Python:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")

Internally, the transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.

Below is an example usage of whisper.detect_language() and whisper.decode() which provide lower-level access to the model.

import whisper

model = whisper.load_model("base")

# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)

# print the recognized text

More examples

Please use the 🙌 Show and tell category in Discussions for sharing more example usages of Whisper and third-party extensions such as web demos, integrations with other tools, ports for different platforms, etc.

Download Details:

Author: Openai
Source Code: https://github.com/openai/whisper 
License: MIT license

#jupyter #notebook #speech #recognition #via 

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
Royce  Reinger

Royce Reinger


TA-Lib: Python wrapper for TA-Lib


This is a Python wrapper for TA-LIB based on Cython instead of SWIG. From the homepage:

TA-Lib is widely used by trading software developers requiring to perform technical analysis of financial market data.

  • Includes 150+ indicators such as ADX, MACD, RSI, Stochastic, Bollinger Bands, etc.
  • Candlestick pattern recognition
  • Open-source API for C/C++, Java, Perl, Python and 100% Managed .NET

The original Python bindings included with TA-Lib use SWIG which unfortunately are difficult to install and aren't as efficient as they could be. Therefore this project uses Cython and Numpy to efficiently and cleanly bind to TA-Lib -- producing results 2-4 times faster than the SWIG interface.

In addition, this project also supports the use of the Polars and Pandas libraries.


You can install from PyPI:

$ python3 -m pip install TA-Lib

Or checkout the sources and run setup.py yourself:

$ python setup.py install

It also appears possible to install via Conda Forge:

$ conda install -c conda-forge ta-lib


To use TA-Lib for python, you need to have the TA-Lib already installed. You should probably follow their installation directions for your platform, but some suggestions are included below for reference.

Some Conda Forge users have reported success installing the underlying TA-Lib C library using the libta-lib package:

$ conda install -c conda-forge libta-lib

Mac OS X

You can simply install using Homebrew:

$ brew install ta-lib

If you are using Apple Silicon, such as the M1 processors, and building mixed architecture Homebrew projects, you might want to make sure it's being built for your architecture:

$ arch -arm64 brew install ta-lib

And perhaps you can set these before installing with pip:

$ export TA_INCLUDE_PATH="$(brew --prefix ta-lib)/include"
$ export TA_LIBRARY_PATH="$(brew --prefix ta-lib)/lib"

You might also find this helpful, particularly if you have tried several different installations without success:

$ your-arm64-python -m pip install --no-cache-dir ta-lib


Download ta-lib-0.4.0-msvc.zip and unzip to C:\ta-lib.

This is a 32-bit binary release. If you want to use 64-bit Python, you will need to build a 64-bit version of the library. Some unofficial (and unsupported) instructions for building on 64-bit Windows 10, here for reference:

  1. Download and Unzip ta-lib-0.4.0-msvc.zip
  2. Move the Unzipped Folder ta-lib to C:\
  3. Download and Install Visual Studio Community (2015 or later)
    • Remember to Select [Visual C++] Feature
  4. Build TA-Lib Library
    • From Windows Start Menu, Start [VS2015 x64 Native Tools Command Prompt]
    • Move to C:\ta-lib\c\make\cdr\win32\msvc
    • Build the Library nmake

You might also try these unofficial windows binaries for both 32-bit and 64-bit:



Download ta-lib-0.4.0-src.tar.gz and:

$ tar -xzf ta-lib-0.4.0-src.tar.gz
$ cd ta-lib/
$ ./configure --prefix=/usr
$ make
$ sudo make install

If you build TA-Lib using make -jX it will fail but that's OK! Simply rerun make -jX followed by [sudo] make install.

Note: if your directory path includes spaces, the installation will probably fail with No such file or directory errors.


If you get a warning that looks like this:

setup.py:79: UserWarning: Cannot find ta-lib library, installation may fail.
warnings.warn('Cannot find ta-lib library, installation may fail.')

This typically means setup.py can't find the underlying TA-Lib library, a dependency which needs to be installed.

If you installed the underlying TA-Lib library with a custom prefix (e.g., with ./configure --prefix=$PREFIX), then when you go to install this python wrapper you can specify additional search paths to find the library and include files for the underlying TA-Lib library using the TA_LIBRARY_PATH and TA_INCLUDE_PATH environment variables:

$ export TA_INCLUDE_PATH=$PREFIX/include
$ python setup.py install # or pip install ta-lib

Sometimes installation will produce build errors like this:

talib/_ta_lib.c:601:10: fatal error: ta-lib/ta_defs.h: No such file or directory
  601 | #include "ta-lib/ta_defs.h"
      |          ^~~~~~~~~~~~~~~~~~
compilation terminated.


common.obj : error LNK2001: unresolved external symbol TA_SetUnstablePeriod
common.obj : error LNK2001: unresolved external symbol TA_Shutdown
common.obj : error LNK2001: unresolved external symbol TA_Initialize
common.obj : error LNK2001: unresolved external symbol TA_GetUnstablePeriod
common.obj : error LNK2001: unresolved external symbol TA_GetVersionString

This typically means that it can't find the underlying TA-Lib library, a dependency which needs to be installed. On Windows, this could be caused by installing the 32-bit binary distribution of the underlying TA-Lib library, but trying to use it with 64-bit Python.

Sometimes installation will fail with errors like this:

talib/common.c:8:22: fatal error: pyconfig.h: No such file or directory
 #include "pyconfig.h"
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

This typically means that you need the Python headers, and should run something like:

$ sudo apt-get install python3-dev

Sometimes building the underlying TA-Lib library has errors running make that look like this:

../libtool: line 1717: cd: .libs/libta_lib.lax/libta_abstract.a: No such file or directory
make[2]: *** [libta_lib.la] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

This might mean that the directory path to the underlying TA-Lib library has spaces in the directory names. Try putting it in a path that does not have any spaces and trying again.

Sometimes you might get this error running setup.py:

/usr/include/limits.h:26:10: fatal error: bits/libc-header-start.h: No such file or directory
#include <bits/libc-header-start.h>

This is likely an issue with trying to compile for 32-bit platform but without the appropriate headers. You might find some success looking at the first answer to this question.

If you get an error on macOS like this:

code signature in <141BC883-189B-322C-AE90-CBF6B5206F67>
'python3.9/site-packages/talib/_ta_lib.cpython-39-darwin.so' not valid for
use in process: Trying to load an unsigned library)

You might look at this question and use xcrun codesign to fix it.

If you wonder why STOCHRSI gives you different results than you expect, probably you want STOCH applied to RSI, which is a little different than the STOCHRSI which is STOCHF applied to RSI:

>>> import talib
>>> import numpy as np
>>> c = np.random.randn(100)

# this is the library function
>>> k, d = talib.STOCHRSI(c)

# this produces the same result, calling STOCHF
>>> rsi = talib.RSI(c)
>>> k, d = talib.STOCHF(rsi, rsi, rsi)

# you might want this instead, calling STOCH
>>> rsi = talib.RSI(c)
>>> k, d = talib.STOCH(rsi, rsi, rsi)

If the build appears to hang, you might be running on a VM with not enough memory -- try 1 GB or 2 GB.

If you get "permission denied" errors such as this, you might need to give your user access to the location where the underlying TA-Lib C library is installed -- or install it to a user-accessible location.

talib/_ta_lib.c:747:28: fatal error: /usr/include/ta-lib/ta_defs.h: Permission denied
 #include "ta-lib/ta-defs.h"
compilation terminated
error: command 'gcc' failed with exit status 1

Function API

Similar to TA-Lib, the Function API provides a lightweight wrapper of the exposed TA-Lib indicators.

Each function returns an output array and have default values for their parameters, unless specified as keyword arguments. Typically, these functions will have an initial "lookback" period (a required number of observations before an output is generated) set to NaN.

For convenience, the Function API supports both numpy.ndarray and pandas.Series and polars.Series inputs.

All of the following examples use the Function API:

import numpy as np
import talib

close = np.random.random(100)

Calculate a simple moving average of the close prices:

output = talib.SMA(close)

Calculating bollinger bands, with triple exponential moving average:

from talib import MA_Type

upper, middle, lower = talib.BBANDS(close, matype=MA_Type.T3)

Calculating momentum of the close prices, with a time period of 5:

output = talib.MOM(close, timeperiod=5)


The underlying TA-Lib C library handles NaN's in a sometimes surprising manner by typically propagating NaN's to the end of the output, for example:

>>> c = np.array([1.0, 2.0, 3.0, np.nan, 4.0, 5.0, 6.0])

>>> talib.SMA(c, 3)
array([nan, nan,  2., nan, nan, nan, nan])

You can compare that to a Pandas rolling mean, where their approach is to output NaN until enough "lookback" values are observed to generate new outputs:

>>> c = pandas.Series([1.0, 2.0, 3.0, np.nan, 4.0, 5.0, 6.0])

>>> c.rolling(3).mean()
0    NaN
1    NaN
2    2.0
3    NaN
4    NaN
5    NaN
6    5.0
dtype: float64

Abstract API

If you're already familiar with using the function API, you should feel right at home using the Abstract API.

Every function takes a collection of named inputs, either a dict of numpy.ndarray or pandas.Series or polars.Series, or a pandas.DataFrame or polars.DataFrame. If a pandas.DataFrame or polars.DataFrame is provided, the output is returned as the same type with named output columns.

For example, inputs could be provided for the typical "OHLCV" data:

import numpy as np

# note that all ndarrays must be the same length!
inputs = {
    'open': np.random.random(100),
    'high': np.random.random(100),
    'low': np.random.random(100),
    'close': np.random.random(100),
    'volume': np.random.random(100)

Functions can either be imported directly or instantiated by name:

from talib import abstract

# directly
SMA = abstract.SMA

# or by name
SMA = abstract.Function('sma')

From there, calling functions is basically the same as the function API:

from talib.abstract import *

# uses close prices (default)
output = SMA(inputs, timeperiod=25)

# uses open prices
output = SMA(inputs, timeperiod=25, price='open')

# uses close prices (default)
upper, middle, lower = BBANDS(inputs, 20, 2, 2)

# uses high, low, close (default)
slowk, slowd = STOCH(inputs, 5, 3, 0, 3, 0) # uses high, low, close by default

# uses high, low, open instead
slowk, slowd = STOCH(inputs, 5, 3, 0, 3, 0, prices=['high', 'low', 'open'])

Streaming API

An experimental Streaming API was added that allows users to compute the latest value of an indicator. This can be faster than using the Function API, for example in an application that receives streaming data, and wants to know just the most recent updated indicator value.

import talib
from talib import stream

close = np.random.random(100)

# the Function API
output = talib.SMA(close)

# the Streaming API
latest = stream.SMA(close)

# the latest value is the same as the last output value
assert (output[-1] - latest) < 0.00001

Supported Indicators and Functions

We can show all the TA functions supported by TA-Lib, either as a list or as a dict sorted by group (e.g. "Overlap Studies", "Momentum Indicators", etc):

import talib

# list of functions
print talib.get_functions()

# dict of functions by group
print talib.get_function_groups()

Indicator Groups

  • Overlap Studies
  • Momentum Indicators
  • Volume Indicators
  • Volatility Indicators
  • Price Transform
  • Cycle Indicators
  • Pattern Recognition

Overlap Studies

BBANDS               Bollinger Bands
DEMA                 Double Exponential Moving Average
EMA                  Exponential Moving Average
HT_TRENDLINE         Hilbert Transform - Instantaneous Trendline
KAMA                 Kaufman Adaptive Moving Average
MA                   Moving average
MAMA                 MESA Adaptive Moving Average
MAVP                 Moving average with variable period
MIDPOINT             MidPoint over period
MIDPRICE             Midpoint Price over period
SAR                  Parabolic SAR
SAREXT               Parabolic SAR - Extended
SMA                  Simple Moving Average
T3                   Triple Exponential Moving Average (T3)
TEMA                 Triple Exponential Moving Average
TRIMA                Triangular Moving Average
WMA                  Weighted Moving Average

Momentum Indicators

ADX                  Average Directional Movement Index
ADXR                 Average Directional Movement Index Rating
APO                  Absolute Price Oscillator
AROON                Aroon
AROONOSC             Aroon Oscillator
BOP                  Balance Of Power
CCI                  Commodity Channel Index
CMO                  Chande Momentum Oscillator
DX                   Directional Movement Index
MACD                 Moving Average Convergence/Divergence
MACDEXT              MACD with controllable MA type
MACDFIX              Moving Average Convergence/Divergence Fix 12/26
MFI                  Money Flow Index
MINUS_DI             Minus Directional Indicator
MINUS_DM             Minus Directional Movement
MOM                  Momentum
PLUS_DI              Plus Directional Indicator
PLUS_DM              Plus Directional Movement
PPO                  Percentage Price Oscillator
ROC                  Rate of change : ((price/prevPrice)-1)*100
ROCP                 Rate of change Percentage: (price-prevPrice)/prevPrice
ROCR                 Rate of change ratio: (price/prevPrice)
ROCR100              Rate of change ratio 100 scale: (price/prevPrice)*100
RSI                  Relative Strength Index
STOCH                Stochastic
STOCHF               Stochastic Fast
STOCHRSI             Stochastic Relative Strength Index
TRIX                 1-day Rate-Of-Change (ROC) of a Triple Smooth EMA
ULTOSC               Ultimate Oscillator
WILLR                Williams' %R

Volume Indicators

AD                   Chaikin A/D Line
ADOSC                Chaikin A/D Oscillator
OBV                  On Balance Volume

Cycle Indicators

HT_DCPERIOD          Hilbert Transform - Dominant Cycle Period
HT_DCPHASE           Hilbert Transform - Dominant Cycle Phase
HT_PHASOR            Hilbert Transform - Phasor Components
HT_SINE              Hilbert Transform - SineWave
HT_TRENDMODE         Hilbert Transform - Trend vs Cycle Mode

Price Transform

AVGPRICE             Average Price
MEDPRICE             Median Price
TYPPRICE             Typical Price
WCLPRICE             Weighted Close Price

Volatility Indicators

ATR                  Average True Range
NATR                 Normalized Average True Range
TRANGE               True Range

Pattern Recognition

CDL2CROWS            Two Crows
CDL3BLACKCROWS       Three Black Crows
CDL3INSIDE           Three Inside Up/Down
CDL3LINESTRIKE       Three-Line Strike
CDL3OUTSIDE          Three Outside Up/Down
CDL3STARSINSOUTH     Three Stars In The South
CDL3WHITESOLDIERS    Three Advancing White Soldiers
CDLADVANCEBLOCK      Advance Block
CDLBELTHOLD          Belt-hold
CDLBREAKAWAY         Breakaway
CDLCONCEALBABYSWALL  Concealing Baby Swallow
CDLCOUNTERATTACK     Counterattack
CDLDOJI              Doji
CDLDOJISTAR          Doji Star
CDLENGULFING         Engulfing Pattern
CDLEVENINGSTAR       Evening Star
CDLGAPSIDESIDEWHITE  Up/Down-gap side-by-side white lines
CDLHAMMER            Hammer
CDLHANGINGMAN        Hanging Man
CDLHARAMI            Harami Pattern
CDLHARAMICROSS       Harami Cross Pattern
CDLHIGHWAVE          High-Wave Candle
CDLHIKKAKE           Hikkake Pattern
CDLHIKKAKEMOD        Modified Hikkake Pattern
CDLHOMINGPIGEON      Homing Pigeon
CDLIDENTICAL3CROWS   Identical Three Crows
CDLINNECK            In-Neck Pattern
CDLKICKING           Kicking
CDLKICKINGBYLENGTH   Kicking - bull/bear determined by the longer marubozu
CDLLADDERBOTTOM      Ladder Bottom
CDLLONGLINE          Long Line Candle
CDLMARUBOZU          Marubozu
CDLMATCHINGLOW       Matching Low
CDLMATHOLD           Mat Hold
CDLMORNINGSTAR       Morning Star
CDLONNECK            On-Neck Pattern
CDLPIERCING          Piercing Pattern
CDLRICKSHAWMAN       Rickshaw Man
CDLRISEFALL3METHODS  Rising/Falling Three Methods
CDLSHOOTINGSTAR      Shooting Star
CDLSHORTLINE         Short Line Candle
CDLSPINNINGTOP       Spinning Top
CDLTAKURI            Takuri (Dragonfly Doji with very long lower shadow)
CDLTASUKIGAP         Tasuki Gap
CDLTHRUSTING         Thrusting Pattern
CDLTRISTAR           Tristar Pattern
CDLUNIQUE3RIVER      Unique 3 River
CDLXSIDEGAP3METHODS  Upside/Downside Gap Three Methods

Statistic Functions

BETA                 Beta
CORREL               Pearson's Correlation Coefficient (r)
LINEARREG            Linear Regression
LINEARREG_ANGLE      Linear Regression Angle
LINEARREG_INTERCEPT  Linear Regression Intercept
LINEARREG_SLOPE      Linear Regression Slope
STDDEV               Standard Deviation
TSF                  Time Series Forecast
VAR                  Variance

Download Details:

Author: TA-Lib
Source Code: https://github.com/TA-Lib/ta-lib-python 
License: View license

#machinelearning #python #cython #finance #pattern #recognition 

TA-Lib: Python wrapper for TA-Lib

Create an Image Recognition App with Flutter


A new flutter plugin project.

Getting Started

This project is a starting point for a Flutter plug-in package, a specialized package that includes platform-specific implementation code for Android and/or iOS.

For help getting started with Flutter, view our online documentation, which offers tutorials, samples, guidance on mobile development, and a full API reference.

Use this package as a library

Depend on it

Run this command:

With Flutter:

 $ flutter pub add flutter_image_recognition

This will add a line like this to your package's pubspec.yaml (and run an implicit flutter pub get):

  flutter_image_recognition: ^0.0.1

Alternatively, your editor might support flutter pub get. Check the docs for your editor to learn more.

Import it

Now in your Dart code, you can use:

import 'package:flutter_image_recognition/flutter_plugin.dart'; 


import 'package:flutter/material.dart';
import 'dart:async';

import 'package:flutter/services.dart';
import 'package:flutter_plugin/flutter_plugin.dart';

void main() {
  runApp(const MyApp());

class MyApp extends StatefulWidget {
  const MyApp({Key? key}) : super(key: key);

  State<MyApp> createState() => _MyAppState();

class _MyAppState extends State<MyApp> {
  String _platformVersion = 'Unknown';

  void initState() {

  // Platform messages are asynchronous, so we initialize in an async method.
  Future<void> initPlatformState() async {
    String platformVersion;
    // Platform messages may fail, so we use a try/catch PlatformException.
    // We also handle the message potentially returning null.
    try {
      platformVersion =
          await FlutterPlugin.platformVersion ?? 'Unknown platform version';
    } on PlatformException {
      platformVersion = 'Failed to get platform version.';

    // If the widget was removed from the tree while the asynchronous platform
    // message was in flight, we want to discard the reply rather than calling
    // setState to update our non-existent appearance.
    if (!mounted) return;

    setState(() {
      _platformVersion = platformVersion;

  Widget build(BuildContext context) {
    return MaterialApp(
      home: Scaffold(
        appBar: AppBar(
          title: const Text('Plugin example app'),
        body: Center(
          child: Column(
            children: [
              Text('Running on: $_platformVersion\n'),
                onPressed: () async {
                  String? message = await FlutterPlugin.sayHello('请说Hello');
                child: const Text('sayHello'),
                color: Colors.yellowAccent,

Download Details:

Author: Lawlignt

Source Code: https://github.com/Lawlignt/flutter_plugin

#flutter #recognition 

Create an Image Recognition App with Flutter
Lawrence  Lesch

Lawrence Lesch


Face-api.js: JavaScript API for Face Detection and Face Recognition


JavaScript face recognition API for the browser and nodejs implemented on top of tensorflow.js core (tensorflow/tfjs-core)


Face Recognition


Face Landmark Detection


Face Expression Recognition


Age Estimation & Gender Recognition


Running the Examples

Clone the repository:

git clone https://github.com/justadudewhohacks/face-api.js.git

Running the Browser Examples

cd face-api.js/examples/examples-browser
npm i
npm start

Browse to http://localhost:3000/.

Running the Nodejs Examples

cd face-api.js/examples/examples-nodejs
npm i

Now run one of the examples using ts-node:

ts-node faceDetection.ts

Or simply compile and run them with node:

tsc faceDetection.ts
node faceDetection.js

face-api.js for the Browser

Simply include the latest script from dist/face-api.js.

Or install it via npm:

npm i face-api.js


face-api.js for Nodejs

We can use the equivalent API in a nodejs environment by polyfilling some browser specifics, such as HTMLImageElement, HTMLCanvasElement and ImageData. The easiest way to do so is by installing the node-canvas package.

Alternatively you can simply construct your own tensors from image data and pass tensors as inputs to the API.

Furthermore you want to install @tensorflow/tfjs-node (not required, but highly recommended), which speeds things up drastically by compiling and binding to the native Tensorflow C++ library:

npm i face-api.js canvas @tensorflow/tfjs-node

Now we simply monkey patch the environment to use the polyfills:

// import nodejs bindings to native tensorflow,
// not required, but will speed up things drastically (python required)
import '@tensorflow/tfjs-node';

// implements nodejs wrappers for HTMLCanvasElement, HTMLImageElement, ImageData
import * as canvas from 'canvas';

import * as faceapi from 'face-api.js';

// patch nodejs environment, we need to provide an implementation of
// HTMLCanvasElement and HTMLImageElement
const { Canvas, Image, ImageData } = canvas
faceapi.env.monkeyPatch({ Canvas, Image, ImageData })

Getting Started

Loading the Models

All global neural network instances are exported via faceapi.nets:

// ageGenderNet
// faceExpressionNet
// faceLandmark68Net
// faceLandmark68TinyNet
// faceRecognitionNet
// ssdMobilenetv1
// tinyFaceDetector
// tinyYolov2

To load a model, you have to provide the corresponding manifest.json file as well as the model weight files (shards) as assets. Simply copy them to your public or assets folder. The manifest.json and shard files of a model have to be located in the same directory / accessible under the same route.

Assuming the models reside in public/models:

await faceapi.nets.ssdMobilenetv1.loadFromUri('/models')
// accordingly for the other models:
// await faceapi.nets.faceLandmark68Net.loadFromUri('/models')
// await faceapi.nets.faceRecognitionNet.loadFromUri('/models')
// ...

In a nodejs environment you can furthermore load the models directly from disk:

await faceapi.nets.ssdMobilenetv1.loadFromDisk('./models')

You can also load the model from a tf.NamedTensorMap:

await faceapi.nets.ssdMobilenetv1.loadFromWeightMap(weightMap)

Alternatively, you can also create own instances of the neural nets:

const net = new faceapi.SsdMobilenetv1()
await net.loadFromUri('/models')

You can also load the weights as a Float32Array (in case you want to use the uncompressed models):

// using fetch
net.load(await faceapi.fetchNetWeights('/models/face_detection_model.weights'))

// using axios
const res = await axios.get('/models/face_detection_model.weights', { responseType: 'arraybuffer' })
const weights = new Float32Array(res.data)

High Level API

In the following input can be an HTML img, video or canvas element or the id of that element.

<img id="myImg" src="images/example.png" />
<video id="myVideo" src="media/example.mp4" />
<canvas id="myCanvas" />
const input = document.getElementById('myImg')
// const input = document.getElementById('myVideo')
// const input = document.getElementById('myCanvas')
// or simply:
// const input = 'myImg'

Detecting Faces

Detect all faces in an image. Returns Array<FaceDetection>:

const detections = await faceapi.detectAllFaces(input)

Detect the face with the highest confidence score in an image. Returns FaceDetection | undefined:

const detection = await faceapi.detectSingleFace(input)

By default detectAllFaces and detectSingleFace utilize the SSD Mobilenet V1 Face Detector. You can specify the face detector by passing the corresponding options object:

const detections1 = await faceapi.detectAllFaces(input, new faceapi.SsdMobilenetv1Options())
const detections2 = await faceapi.detectAllFaces(input, new faceapi.TinyFaceDetectorOptions())

You can tune the options of each face detector as shown here.

Detecting 68 Face Landmark Points

After face detection, we can furthermore predict the facial landmarks for each detected face as follows:

Detect all faces in an image + computes 68 Point Face Landmarks for each detected face. Returns Array<WithFaceLandmarks<WithFaceDetection<{}>>>:

const detectionsWithLandmarks = await faceapi.detectAllFaces(input).withFaceLandmarks()

Detect the face with the highest confidence score in an image + computes 68 Point Face Landmarks for that face. Returns WithFaceLandmarks<WithFaceDetection<{}>> | undefined:

const detectionWithLandmarks = await faceapi.detectSingleFace(input).withFaceLandmarks()

You can also specify to use the tiny model instead of the default model:

const useTinyModel = true
const detectionsWithLandmarks = await faceapi.detectAllFaces(input).withFaceLandmarks(useTinyModel)

Computing Face Descriptors

After face detection and facial landmark prediction the face descriptors for each face can be computed as follows:

Detect all faces in an image + compute 68 Point Face Landmarks for each detected face. Returns Array<WithFaceDescriptor<WithFaceLandmarks<WithFaceDetection<{}>>>>:

const results = await faceapi.detectAllFaces(input).withFaceLandmarks().withFaceDescriptors()

Detect the face with the highest confidence score in an image + compute 68 Point Face Landmarks and face descriptor for that face. Returns WithFaceDescriptor<WithFaceLandmarks<WithFaceDetection<{}>>> | undefined:

const result = await faceapi.detectSingleFace(input).withFaceLandmarks().withFaceDescriptor()

Recognizing Face Expressions

Face expression recognition can be performed for detected faces as follows:

Detect all faces in an image + recognize face expressions of each face. Returns Array<WithFaceExpressions<WithFaceLandmarks<WithFaceDetection<{}>>>>:

const detectionsWithExpressions = await faceapi.detectAllFaces(input).withFaceLandmarks().withFaceExpressions()

Detect the face with the highest confidence score in an image + recognize the face expressions for that face. Returns WithFaceExpressions<WithFaceLandmarks<WithFaceDetection<{}>>> | undefined:

const detectionWithExpressions = await faceapi.detectSingleFace(input).withFaceLandmarks().withFaceExpressions()

You can also skip .withFaceLandmarks(), which will skip the face alignment step (less stable accuracy):

Detect all faces without face alignment + recognize face expressions of each face. Returns Array<WithFaceExpressions<WithFaceDetection<{}>>>:

const detectionsWithExpressions = await faceapi.detectAllFaces(input).withFaceExpressions()

Detect the face with the highest confidence score without face alignment + recognize the face expression for that face. Returns WithFaceExpressions<WithFaceDetection<{}>> | undefined:

const detectionWithExpressions = await faceapi.detectSingleFace(input).withFaceExpressions()

Age Estimation and Gender Recognition

Age estimation and gender recognition from detected faces can be done as follows:

Detect all faces in an image + estimate age and recognize gender of each face. Returns Array<WithAge<WithGender<WithFaceLandmarks<WithFaceDetection<{}>>>>>:

const detectionsWithAgeAndGender = await faceapi.detectAllFaces(input).withFaceLandmarks().withAgeAndGender()

Detect the face with the highest confidence score in an image + estimate age and recognize gender for that face. Returns WithAge<WithGender<WithFaceLandmarks<WithFaceDetection<{}>>>> | undefined:

const detectionWithAgeAndGender = await faceapi.detectSingleFace(input).withFaceLandmarks().withAgeAndGender()

You can also skip .withFaceLandmarks(), which will skip the face alignment step (less stable accuracy):

Detect all faces without face alignment + estimate age and recognize gender of each face. Returns Array<WithAge<WithGender<WithFaceDetection<{}>>>>:

const detectionsWithAgeAndGender = await faceapi.detectAllFaces(input).withAgeAndGender()

Detect the face with the highest confidence score without face alignment + estimate age and recognize gender for that face. Returns WithAge<WithGender<WithFaceDetection<{}>>> | undefined:

const detectionWithAgeAndGender = await faceapi.detectSingleFace(input).withAgeAndGender()

Composition of Tasks

Tasks can be composed as follows:

// all faces
await faceapi.detectAllFaces(input)
await faceapi.detectAllFaces(input).withFaceExpressions()
await faceapi.detectAllFaces(input).withFaceLandmarks()
await faceapi.detectAllFaces(input).withFaceLandmarks().withFaceExpressions()
await faceapi.detectAllFaces(input).withFaceLandmarks().withFaceExpressions().withFaceDescriptors()
await faceapi.detectAllFaces(input).withFaceLandmarks().withAgeAndGender().withFaceDescriptors()
await faceapi.detectAllFaces(input).withFaceLandmarks().withFaceExpressions().withAgeAndGender().withFaceDescriptors()

// single face
await faceapi.detectSingleFace(input)
await faceapi.detectSingleFace(input).withFaceExpressions()
await faceapi.detectSingleFace(input).withFaceLandmarks()
await faceapi.detectSingleFace(input).withFaceLandmarks().withFaceExpressions()
await faceapi.detectSingleFace(input).withFaceLandmarks().withFaceExpressions().withFaceDescriptor()
await faceapi.detectSingleFace(input).withFaceLandmarks().withAgeAndGender().withFaceDescriptor()
await faceapi.detectSingleFace(input).withFaceLandmarks().withFaceExpressions().withAgeAndGender().withFaceDescriptor()

Face Recognition by Matching Descriptors

To perform face recognition, one can use faceapi.FaceMatcher to compare reference face descriptors to query face descriptors.

First, we initialize the FaceMatcher with the reference data, for example we can simply detect faces in a referenceImage and match the descriptors of the detected faces to faces of subsequent images:

const results = await faceapi

if (!results.length) {

// create FaceMatcher with automatically assigned labels
// from the detection results for the reference image
const faceMatcher = new faceapi.FaceMatcher(results)

Now we can recognize a persons face shown in queryImage1:

const singleResult = await faceapi

if (singleResult) {
  const bestMatch = faceMatcher.findBestMatch(singleResult.descriptor)

Or we can recognize all faces shown in queryImage2:

const results = await faceapi

results.forEach(fd => {
  const bestMatch = faceMatcher.findBestMatch(fd.descriptor)

You can also create labeled reference descriptors as follows:

const labeledDescriptors = [
  new faceapi.LabeledFaceDescriptors(
    [descriptorObama1, descriptorObama2]
  new faceapi.LabeledFaceDescriptors(

const faceMatcher = new faceapi.FaceMatcher(labeledDescriptors)


Displaying Detection Results

Preparing the overlay canvas:

const displaySize = { width: input.width, height: input.height }
// resize the overlay canvas to the input dimensions
const canvas = document.getElementById('overlay')
faceapi.matchDimensions(canvas, displaySize)

face-api.js predefines some highlevel drawing functions, which you can utilize:

/* Display detected face bounding boxes */
const detections = await faceapi.detectAllFaces(input)
// resize the detected boxes in case your displayed image has a different size than the original
const resizedDetections = faceapi.resizeResults(detections, displaySize)
// draw detections into the canvas
faceapi.draw.drawDetections(canvas, resizedDetections)

/* Display face landmarks */
const detectionsWithLandmarks = await faceapi
// resize the detected boxes and landmarks in case your displayed image has a different size than the original
const resizedResults = faceapi.resizeResults(detectionsWithLandmarks, displaySize)
// draw detections into the canvas
faceapi.draw.drawDetections(canvas, resizedResults)
// draw the landmarks into the canvas
faceapi.draw.drawFaceLandmarks(canvas, resizedResults)

/* Display face expression results */
const detectionsWithExpressions = await faceapi
// resize the detected boxes and landmarks in case your displayed image has a different size than the original
const resizedResults = faceapi.resizeResults(detectionsWithExpressions, displaySize)
// draw detections into the canvas
faceapi.draw.drawDetections(canvas, resizedResults)
// draw a textbox displaying the face expressions with minimum probability into the canvas
const minProbability = 0.05
faceapi.draw.drawFaceExpressions(canvas, resizedResults, minProbability)

You can also draw boxes with custom text (DrawBox):

const box = { x: 50, y: 50, width: 100, height: 100 }
// see DrawBoxOptions below
const drawOptions = {
  label: 'Hello I am a box!',
  lineWidth: 2
const drawBox = new faceapi.draw.DrawBox(box, drawOptions)

DrawBox drawing options:

export interface IDrawBoxOptions {
  boxColor?: string
  lineWidth?: number
  drawLabelOptions?: IDrawTextFieldOptions
  label?: string

Finally you can draw custom text fields (DrawTextField):

const text = [
  'This is a textline!',
  'This is another textline!'
const anchor = { x: 200, y: 200 }
// see DrawTextField below
const drawOptions = {
  anchorPosition: 'TOP_LEFT',
  backgroundColor: 'rgba(0, 0, 0, 0.5)'
const drawBox = new faceapi.draw.DrawTextField(text, anchor, drawOptions)

DrawTextField drawing options:

export interface IDrawTextFieldOptions {
  anchorPosition?: AnchorPosition
  backgroundColor?: string
  fontColor?: string
  fontSize?: number
  fontStyle?: string
  padding?: number

export enum AnchorPosition {


Face Detection Options


export interface ISsdMobilenetv1Options {
  // minimum confidence threshold
  // default: 0.5
  minConfidence?: number

  // maximum number of faces to return
  // default: 100
  maxResults?: number

// example
const options = new faceapi.SsdMobilenetv1Options({ minConfidence: 0.8 })


export interface ITinyFaceDetectorOptions {
  // size at which image is processed, the smaller the faster,
  // but less precise in detecting smaller faces, must be divisible
  // by 32, common sizes are 128, 160, 224, 320, 416, 512, 608,
  // for face tracking via webcam I would recommend using smaller sizes,
  // e.g. 128, 160, for detecting smaller faces use larger sizes, e.g. 512, 608
  // default: 416
  inputSize?: number

  // minimum confidence threshold
  // default: 0.5
  scoreThreshold?: number

// example
const options = new faceapi.TinyFaceDetectorOptions({ inputSize: 320 })


Utility Classes


export interface IBox {
  x: number
  y: number
  width: number
  height: number


export interface IFaceDetection {
  score: number
  box: Box


export interface IFaceLandmarks {
  positions: Point[]
  shift: Point


export type WithFaceDetection<TSource> = TSource & {
  detection: FaceDetection


export type WithFaceLandmarks<TSource> = TSource & {
  unshiftedLandmarks: FaceLandmarks
  landmarks: FaceLandmarks
  alignedRect: FaceDetection


export type WithFaceDescriptor<TSource> = TSource & {
  descriptor: Float32Array


export type WithFaceExpressions<TSource> = TSource & {
  expressions: FaceExpressions


export type WithAge<TSource> = TSource & {
  age: number


export type WithGender<TSource> = TSource & {
  gender: Gender
  genderProbability: number

export enum Gender {
  FEMALE = 'female',
  MALE = 'male'


Other Useful Utility

Using the Low Level API

Instead of using the high level API, you can directly use the forward methods of each neural network:

const detections1 = await faceapi.ssdMobilenetv1(input, options)
const detections2 = await faceapi.tinyFaceDetector(input, options)
const landmarks1 = await faceapi.detectFaceLandmarks(faceImage)
const landmarks2 = await faceapi.detectFaceLandmarksTiny(faceImage)
const descriptor = await faceapi.computeFaceDescriptor(alignedFaceImage)

Extracting a Canvas for an Image Region

const regionsToExtract = [
  new faceapi.Rect(0, 0, 100, 100)
// actually extractFaces is meant to extract face regions from bounding boxes
// but you can also use it to extract any other region
const canvases = await faceapi.extractFaces(input, regionsToExtract)

Euclidean Distance

// ment to be used for computing the euclidean distance between two face descriptors
const dist = faceapi.euclideanDistance([0, 0], [0, 10])
console.log(dist) // 10

Retrieve the Face Landmark Points and Contours

const landmarkPositions = landmarks.positions

// or get the positions of individual contours,
// only available for 68 point face ladnamrks (FaceLandmarks68)
const jawOutline = landmarks.getJawOutline()
const nose = landmarks.getNose()
const mouth = landmarks.getMouth()
const leftEye = landmarks.getLeftEye()
const rightEye = landmarks.getRightEye()
const leftEyeBbrow = landmarks.getLeftEyeBrow()
const rightEyeBrow = landmarks.getRightEyeBrow()

Fetch and Display Images from an URL

<img id="myImg" src="">
const image = await faceapi.fetchImage('/images/example.png')

console.log(image instanceof HTMLImageElement) // true

// displaying the fetched image content
const myImg = document.getElementById('myImg')
myImg.src = image.src

Fetching JSON

const json = await faceapi.fetchJson('/files/example.json')

Creating an Image Picker

<img id="myImg" src="">
<input id="myFileUpload" type="file" onchange="uploadImage()" accept=".jpg, .jpeg, .png">
async function uploadImage() {
  const imgFile = document.getElementById('myFileUpload').files[0]
  // create an HTMLImageElement from a Blob
  const img = await faceapi.bufferToImage(imgFile)
  document.getElementById('myImg').src = img.src

Creating a Canvas Element from an Image or Video Element

<img id="myImg" src="images/example.png" />
<video id="myVideo" src="media/example.mp4" />
const canvas1 = faceapi.createCanvasFromMedia(document.getElementById('myImg'))
const canvas2 = faceapi.createCanvasFromMedia(document.getElementById('myVideo'))

Available Models

Face Detection Models

SSD Mobilenet V1

For face detection, this project implements a SSD (Single Shot Multibox Detector) based on MobileNetV1. The neural net will compute the locations of each face in an image and will return the bounding boxes together with it's probability for each face. This face detector is aiming towards obtaining high accuracy in detecting face bounding boxes instead of low inference time. The size of the quantized model is about 5.4 MB (ssd_mobilenetv1_model).

The face detection model has been trained on the WIDERFACE dataset and the weights are provided by yeephycho in this repo.

Tiny Face Detector

The Tiny Face Detector is a very performant, realtime face detector, which is much faster, smaller and less resource consuming compared to the SSD Mobilenet V1 face detector, in return it performs slightly less well on detecting small faces. This model is extremely mobile and web friendly, thus it should be your GO-TO face detector on mobile devices and resource limited clients. The size of the quantized model is only 190 KB (tiny_face_detector_model).

The face detector has been trained on a custom dataset of ~14K images labeled with bounding boxes. Furthermore the model has been trained to predict bounding boxes, which entirely cover facial feature points, thus it in general produces better results in combination with subsequent face landmark detection than SSD Mobilenet V1.

This model is basically an even tinier version of Tiny Yolo V2, replacing the regular convolutions of Yolo with depthwise separable convolutions. Yolo is fully convolutional, thus can easily adapt to different input image sizes to trade off accuracy for performance (inference time).

68 Point Face Landmark Detection Models

This package implements a very lightweight and fast, yet accurate 68 point face landmark detector. The default model has a size of only 350kb (face_landmark_68_model) and the tiny model is only 80kb (face_landmark_68_tiny_model). Both models employ the ideas of depthwise separable convolutions as well as densely connected blocks. The models have been trained on a dataset of ~35k face images labeled with 68 face landmark points.

Face Recognition Model

For face recognition, a ResNet-34 like architecture is implemented to compute a face descriptor (a feature vector with 128 values) from any given face image, which is used to describe the characteristics of a persons face. The model is not limited to the set of faces used for training, meaning you can use it for face recognition of any person, for example yourself. You can determine the similarity of two arbitrary faces by comparing their face descriptors, for example by computing the euclidean distance or using any other classifier of your choice.

The neural net is equivalent to the FaceRecognizerNet used in face-recognition.js and the net used in the dlib face recognition example. The weights have been trained by davisking and the model achieves a prediction accuracy of 99.38% on the LFW (Labeled Faces in the Wild) benchmark for face recognition.

The size of the quantized model is roughly 6.2 MB (face_recognition_model).

Face Expression Recognition Model

The face expression recognition model is lightweight, fast and provides reasonable accuracy. The model has a size of roughly 310kb and it employs depthwise separable convolutions and densely connected blocks. It has been trained on a variety of images from publicly available datasets as well as images scraped from the web. Note, that wearing glasses might decrease the accuracy of the prediction results.

Age and Gender Recognition Model

The age and gender recognition model is a multitask network, which employs a feature extraction layer, an age regression layer and a gender classifier. The model has a size of roughly 420kb and the feature extractor employs a tinier but very similar architecture to Xception.

This model has been trained and tested on the following databases with an 80/20 train/test split each: UTK, FGNET, Chalearn, Wiki, IMDB*, CACD*, MegaAge, MegaAge-Asian. The * indicates, that these databases have been algorithmically cleaned up, since the initial databases are very noisy.

Total Test Results

Total MAE (Mean Age Error): 4.54

Total Gender Accuracy: 95%

Test results for each database

The - indicates, that there are no gender labels available for these databases.

Gender Accuracy0.93-0.940.95-0.97--

Test results for different age category groups

Age Range0 - 34 - 89 - 1819 - 2829 - 4041 - 6060 - 8080+
Gender Accuracy0.690.800.880.960.970.970.960.9



Click me for Live Demos!

Download Details:

Author: justadudewhohacks
Source Code: https://github.com/justadudewhohacks/face-api.js 
License: MIT license

#typescript #javascript #nodejs #tensorflow #face #recognition 

Face-api.js: JavaScript API for Face Detection and Face Recognition
Lawrence  Lesch

Lawrence Lesch


Yoha: A Practical Hand Tracking Engine


A practical hand tracking engine.


npm install @handtracking.io/yoha

Please note:

  • You need to serve the files from node_modules/@handtracking.io/yoha since the library needs to download the model files from here. (Webpack Example)
  • You need to serve your page with https for webcam access. (Webpack Example)
  • You should use cross-origin isolation as it improves the engine's performance in certain scenarios. (Webpack Example)


Yoha is a hand tracking engine that is built with the goal of being a versatile solution in practical scenarios where hand tracking is employed to add value to an application. While ultimately the goal is to be a general purpose hand tracking engine supporting any hand pose, the engine evolves around specific hand poses that users/developers find useful. These poses are detected by the engine which allows to build applications with meaningful interactions. See the demo for an example.

Yoha is currently in beta.

About the name: Yoha is short for ("Your Hand Tracking").

Language Support

Yoha is currently available for the web via JavaScript. More languages will be added in the future. If you want to port Yoha to another language and need help feel free reach out.

Technical Details

Yoha was built from scratch. It uses a custom neural network trained using a custom dataset. The backbone for the inference in the browser is currently TensorFlow.js


  • Detection of 21 2D-landmark coordinates (single hand).
  • Hand presence detection.
  • Hand orientation (left/right hand) detection.
  • Inbuilt pose detection.

Supported Hand Poses:

  • Pinch (index finger and thumb touch)
  • Fist

Your desired pose is not on this list? Feel free to create an issue for it.


Yoha was built with performance in mind. It is able to provide realtime user experience on a broad range of laptops and desktop devices. The performance on mobile devices is not great which hopefuly will change with the further development of inference frameworks like TensorFlow.js

Please note that native inference speed can not be compared with the web inference speed. Differently put, if you were to run Yoha natively it would be much faster than via the web browser.

Minimal Example

git clone https://github.com/handtracking-io/yoha && \
cd yoha/example && \
yarn && \
yarn start

Drawing Demo

git clone https://github.com/handtracking-io/yoha && \
cd yoha && \
./download_models.sh && \
yarn && \
yarn start

Quick Links:

Download Details:

Author: Handtracking-io
Source Code: https://github.com/handtracking-io/yoha 
License: MIT license

#typescript #javascript #web #AI #recognition #hand 

Yoha: A Practical Hand Tracking Engine
Franz  Becker

Franz Becker


7 Most Popular Speech Recognition Software

What is Speech Recognition Software and where is it used?

A speech recognition software is used to convert verbal language into text form using algorithms. Based on the report by Research and Markets, the speech recognition market will reach from USD 10.70 billion in 2020 to USD 27.155 billion in 2026.

Speech Recognition Software is not just something that is used for complex processes but also availed in day-to-day life. For example, voice assistants are now being used in smartphones, at homes, and even in automated vehicles. 

When it comes to businesses, it is used by customer service executives to process routine requests. In the healthcare and legal system, it is used for documentation. Basically, speech recognition software helps companies in improving communication by translating data that is easy to search and manage. There are many advanced solutions that provide AI or biometric speech recognition.

Natural Language Processing or NLP is one part of artificial intelligence that analyzes language data through conversion. Speech recognition and AI play an important role in NLP models. They also help in improving the accuracy of recognizing human language. With the use of AI, speech recognition happens with more accuracy.

Which are the Top Speech Recognition Software?

Here are the top 7 speech recognition software:

  1. Braina
  2. Dragon Speech Recognition Solutions 
  3. Winscribe
  4. Gboard
  5. Windows 10 Speech Recognition
  6. Otter
  7. Speechnotes

Let’s talk about each in detail.

1. Braina

Briana is one of the highly preferred speech recognition software that is adept in handling over 90 languages with a high level of accuracy. It follows an AI-based voice recognition system via which you can control apps and interpret text on any application and website. The best part is that it is compatible with Windows, iOS, and Android. It is available in 3 versions: Braina Lite, Briana PRO, and Briana PRO Lifetime. Whereas the first one is free of cost, the latter two have an annual and lifetime subscription.

2. Dragon Speech Recognition Solutions

This is one of the best software products out there for students, healthcare, and legal professionals. Owned by Nuance, it also supports cloud document management. It has an accuracy rate of 99%. It is available in mobile and desktop versions and supports languages like Dutch, English, Italian, French, German, and Spanish. Its feature of Auto-Text allows you to shorten the entire address in just a single word! When it comes to versions, Dragon Home, Professional firms are chargeable. Users are also given a free 7 days trial to know the software better.

3. Winscribe

Winscribe is a software owned by the company Nuance and provides documentation workflow management so that users can organize their text. It supports Android, iPhone, and PC devices. It provides fast, easy, and secure documentation solutions. This way it aims at giving professionals more time to do work that adds value to their organization. One thing to note here is that Winscribe is a speech recognition and document management application meant for professionals at medium and large organizations.

4. Gboard

I am sure you must be familiar with this one. It is an easy-to-use application for android users. It allows you to dictate text, perform emoji and GIF search when texting, and also offers swipe-style input. Simple, accurate and free of cost – it works well with both Android and iOS. It also offers the option of personalizing the app so that Gboard can recognize the voice usage patterns of the user and improve on them. This promises higher accuracy over time.

5. Windows 10 Speech Recognition

If you are a Windows 10 user, this speech recognition software is included on our PC already, you do not need to install or download it. Its awesome features like accepting punctuation commands, or the commands for moving the cursor within your document, deleting words and more make it an excellent choice for speech recognition software. But yes, the commands are available only in US English. This software supports Mandarin Chinese, English, French, Japanese, Portuguese, and German; to name a few languages.

6. Otter

This is a speech recognition software designed by a company called Inedo. This was built primarily to support Windows. It uses an AI-powered transcription service that can be easily integrated with Zoom or Google Meet to record meetings or webinars. Otter supports a variety of accents, such as: Indian, American, German, Swiss, British, Chinese, etc. This software is available in 3 packages: Essential Otter, Premium Otter, Teams Otter.

7. Speechnotes

A powerful, speech-enabled online notepad, Speechnotes allows users to type at speed of speech and also lets you move from voice to key typing swiftly. This product has been featured on ProductHunt, Techpp.com and other leading magazines loved by tech users. It is powered by Google’s Speech-to-Text engines and makes for a world-class business tool. Its accuracy is measured over 90%. It comes in 2 versions: Basic and Premium. The basic version is free of cost and the premium chrome extension costs USD 9.9. The best part about this software is that it can work on any website, has custom text stamps, and can be exported easily to Google Drive.

Begin Now

With  the impact of speech recognition in various fields, and AI in personal lives as well as at work, there is a high demand for AI engineers and machine learning engineers. They are required to build a stronger relationship between the software and humans. 

You can begin your journey of AI with the Post Graduate Program in Artificial Learning & Machine Learning: Business Applications, offered by McCombs School of Business  at The University of Texas at Austin. This is one of the best artificial intelligence training programs offered by one of the top universities in the world.

The 6-month long online program Master the skills needed to build machine learning and deep learning models with lectures from UT Austin faculty and live mentorship sessions by Industry professionals. You also get access to videos lectures, hands-on practical projects, assignments and quizzes curated to carve your niche in the AI market. At the completion of the online program, you’ll also earn a certificate by UT Austin.

Original article source at: https://www.mygreatlearning.com


7 Most Popular Speech Recognition Software

19 The Best Image Recognition Apps in 2022


In the age of smartphones and image-centric social media, image recognition apps are becoming more and more popular. These apps allow users to search for and identify specific images and recognize objects or scenes in pictures. While there are many image recognition apps available on the market, the best ones are those that can accurately identify a wide range of images.

If you’re interested in learning more about the best image recognition apps available in 2022, be sure to read our article!

#artificial-intelligence #recognition #python #machine-learning #tensorflow #keras #pandas #numpy #business #startup #startups 

19 The Best Image Recognition Apps in 2022

Speech Recognition Module for Python, Supporting Several Engines, APIs


Library for performing speech recognition, with support for several engines and APIs, online and offline.

Speech recognition engine/API support:

Quickstart: pip install SpeechRecognition. See the "Installing" section for more details.

To quickly try it out, run python -m speech_recognition after installing.

Project links:

Library Reference

The library reference documents every publicly accessible object in the library. This document is also included under reference/library-reference.rst.

See Notes on using PocketSphinx for information about installing languages, compiling PocketSphinx, and building language packs from online resources. This document is also included under reference/pocketsphinx.rst.


See the examples/ directory in the repository root for usage examples:


First, make sure you have all the requirements listed in the "Requirements" section.

The easiest way to install this is using pip install SpeechRecognition.

Otherwise, download the source distribution from PyPI, and extract the archive.

In the folder, run python setup.py install.


To use all of the functionality of the library, you should have:

  • Python 2.6, 2.7, or 3.3+ (required)
  • PyAudio 0.2.11+ (required only if you need to use microphone input, Microphone)
  • PocketSphinx (required only if you need to use the Sphinx recognizer, recognizer_instance.recognize_sphinx)
  • Google API Client Library for Python (required only if you need to use the Google Cloud Speech API, recognizer_instance.recognize_google_cloud)
  • FLAC encoder (required only if the system is not x86-based Windows/Linux/OS X)

The following requirements are optional, but can improve or extend functionality in some situations:

  • On Python 2, and only on Python 2, some functions (like recognizer_instance.recognize_bing) will run slower if you do not have Monotonic for Python 2 installed.
  • If using CMU Sphinx, you may want to install additional language packs to support languages like International French or Mandarin Chinese.

The following sections go over the details of each requirement.


The first software requirement is Python 2.6, 2.7, or Python 3.3+. This is required to use the library.

PyAudio (for microphone users)

PyAudio is required if and only if you want to use microphone input (Microphone). PyAudio version 0.2.11+ is required, as earlier versions have known memory management bugs when recording from microphones in certain situations.

If not installed, everything in the library will still work, except attempting to instantiate a Microphone object will raise an AttributeError.

The installation instructions on the PyAudio website are quite good - for convenience, they are summarized below:

  • On Windows, install PyAudio using Pip: execute pip install pyaudio in a terminal.
  • On Debian-derived Linux distributions (like Ubuntu and Mint), install PyAudio using APT: execute sudo apt-get install python-pyaudio python3-pyaudio in a terminal.

If the version in the repositories is too old, install the latest release using Pip: execute sudo apt-get install portaudio19-dev python-all-dev python3-all-dev && sudo pip install pyaudio (replace pip with pip3 if using Python 3).

  • On OS X, install PortAudio using Homebrew: brew install portaudio. Then, install PyAudio using Pip: pip install pyaudio.
  • On other POSIX-based systems, install the portaudio19-dev and python-all-dev (or python3-all-dev if using Python 3) packages (or their closest equivalents) using a package manager of your choice, and then install PyAudio using Pip: pip install pyaudio (replace pip with pip3 if using Python 3).

PyAudio wheel packages for common 64-bit Python versions on Windows and Linux are included for convenience, under the third-party/ directory in the repository root. To install, simply run pip install wheel followed by pip install ./third-party/WHEEL_FILENAME (replace pip with pip3 if using Python 3) in the repository root directory.

PocketSphinx-Python (for Sphinx users)

PocketSphinx-Python is required if and only if you want to use the Sphinx recognizer (recognizer_instance.recognize_sphinx).

PocketSphinx-Python wheel packages for 64-bit Python 2.7, 3.4, and 3.5 on Windows are included for convenience, under the third-party/ directory. To install, simply run pip install wheel followed by pip install ./third-party/WHEEL_FILENAME (replace pip with pip3 if using Python 3) in the SpeechRecognition folder.

On Linux and other POSIX systems (such as OS X), follow the instructions under "Building PocketSphinx-Python from source" in Notes on using PocketSphinx for installation instructions.

Note that the versions available in most package repositories are outdated and will not work with the bundled language data. Using the bundled wheel packages or building from source is recommended.

See Notes on using PocketSphinx for information about installing languages, compiling PocketSphinx, and building language packs from online resources. This document is also included under reference/pocketsphinx.rst.

Google Cloud Speech Library for Python (for Google Cloud Speech API users)

Google Cloud Speech library for Python is required if and only if you want to use the Google Cloud Speech API (recognizer_instance.recognize_google_cloud).

If not installed, everything in the library will still work, except calling recognizer_instance.recognize_google_cloud will raise an RequestError.

According to the official installation instructions, the recommended way to install this is using Pip: execute pip install google-cloud-speech (replace pip with pip3 if using Python 3).

FLAC (for some systems)

A FLAC encoder is required to encode the audio data to send to the API. If using Windows (x86 or x86-64), OS X (Intel Macs only, OS X 10.6 or higher), or Linux (x86 or x86-64), this is already bundled with this library - you do not need to install anything.

Otherwise, ensure that you have the flac command line tool, which is often available through the system package manager. For example, this would usually be sudo apt-get install flac on Debian-derivatives, or brew install flac on OS X with Homebrew.

Monotonic for Python 2 (for faster operations in some functions on Python 2)

On Python 2, and only on Python 2, if you do not install the Monotonic for Python 2 library, some functions will run slower than they otherwise could (though everything will still work correctly).

On Python 3, that library's functionality is built into the Python standard library, which makes it unnecessary.

This is because monotonic time is necessary to handle cache expiry properly in the face of system time changes and other time-related issues. If monotonic time functionality is not available, then things like access token requests will not be cached.

To install, use Pip: execute pip install monotonic in a terminal.


The recognizer tries to recognize speech even when I'm not speaking, or after I'm done speaking.

Try increasing the recognizer_instance.energy_threshold property. This is basically how sensitive the recognizer is to when recognition should start. Higher values mean that it will be less sensitive, which is useful if you are in a loud room.

This value depends entirely on your microphone or audio data. There is no one-size-fits-all value, but good values typically range from 50 to 4000.

Also, check on your microphone volume settings. If it is too sensitive, the microphone may be picking up a lot of ambient noise. If it is too insensitive, the microphone may be rejecting speech as just noise.

The recognizer can't recognize speech right after it starts listening for the first time.

The recognizer_instance.energy_threshold property is probably set to a value that is too high to start off with, and then being adjusted lower automatically by dynamic energy threshold adjustment. Before it is at a good level, the energy threshold is so high that speech is just considered ambient noise.

The solution is to decrease this threshold, or call recognizer_instance.adjust_for_ambient_noise beforehand, which will set the threshold to a good value automatically.

The recognizer doesn't understand my particular language/dialect.

Try setting the recognition language to your language/dialect. To do this, see the documentation for recognizer_instance.recognize_sphinx, recognizer_instance.recognize_google, recognizer_instance.recognize_wit, recognizer_instance.recognize_bing, recognizer_instance.recognize_api, recognizer_instance.recognize_houndify, and recognizer_instance.recognize_ibm.

For example, if your language/dialect is British English, it is better to use "en-GB" as the language rather than "en-US".

The recognizer hangs on recognizer_instance.listen; specifically, when it's calling Microphone.MicrophoneStream.read.

This usually happens when you're using a Raspberry Pi board, which doesn't have audio input capabilities by itself. This causes the default microphone used by PyAudio to simply block when we try to read it. If you happen to be using a Raspberry Pi, you'll need a USB sound card (or USB microphone).

Once you do this, change all instances of Microphone() to Microphone(device_index=MICROPHONE_INDEX), where MICROPHONE_INDEX is the hardware-specific index of the microphone.

To figure out what the value of MICROPHONE_INDEX should be, run the following code:

import speech_recognition as sr
for index, name in enumerate(sr.Microphone.list_microphone_names()):
    print("Microphone with name \"{1}\" found for `Microphone(device_index={0})`".format(index, name))

This will print out something like the following:

Microphone with name "HDA Intel HDMI: 0 (hw:0,3)" found for `Microphone(device_index=0)`
Microphone with name "HDA Intel HDMI: 1 (hw:0,7)" found for `Microphone(device_index=1)`
Microphone with name "HDA Intel HDMI: 2 (hw:0,8)" found for `Microphone(device_index=2)`
Microphone with name "Blue Snowball: USB Audio (hw:1,0)" found for `Microphone(device_index=3)`
Microphone with name "hdmi" found for `Microphone(device_index=4)`
Microphone with name "pulse" found for `Microphone(device_index=5)`
Microphone with name "default" found for `Microphone(device_index=6)`

Now, to use the Snowball microphone, you would change Microphone() to Microphone(device_index=3).

Calling Microphone() gives the error IOError: No Default Input Device Available.

As the error says, the program doesn't know which microphone to use.

To proceed, either use Microphone(device_index=MICROPHONE_INDEX, ...) instead of Microphone(...), or set a default microphone in your OS. You can obtain possible values of MICROPHONE_INDEX using the code in the troubleshooting entry right above this one.

The code examples raise UnicodeEncodeError: 'ascii' codec can't encode character when run.

When you're using Python 2, and your language uses non-ASCII characters, and the terminal or file-like object you're printing to only supports ASCII, an error is raised when trying to write non-ASCII characters.

This is because in Python 2, recognizer_instance.recognize_sphinx, recognizer_instance.recognize_google, recognizer_instance.recognize_wit, recognizer_instance.recognize_bing, recognizer_instance.recognize_api, recognizer_instance.recognize_houndify, and recognizer_instance.recognize_ibm return unicode strings (u"something") rather than byte strings ("something"). In Python 3, all strings are unicode strings.

To make printing of unicode strings work in Python 2 as well, replace all print statements in your code of the following form:


With the following:

print SOME_UNICODE_STRING.encode("utf8")

This change, however, will prevent the code from working in Python 3.

The program doesn't run when compiled with PyInstaller.

As of PyInstaller version 3.0, SpeechRecognition is supported out of the box. If you're getting weird issues when compiling your program using PyInstaller, simply update PyInstaller.

You can easily do this by running pip install --upgrade pyinstaller.

On Ubuntu/Debian, I get annoying output in the terminal saying things like "bt_audio_service_open: [...] Connection refused" and various others.

The "bt_audio_service_open" error means that you have a Bluetooth audio device, but as a physical device is not currently connected, we can't actually use it - if you're not using a Bluetooth microphone, then this can be safely ignored. If you are, and audio isn't working, then double check to make sure your microphone is actually connected. There does not seem to be a simple way to disable these messages.

For errors of the form "ALSA lib [...] Unknown PCM", see this StackOverflow answer. Basically, to get rid of an error of the form "Unknown PCM cards.pcm.rear", simply comment out pcm.rear cards.pcm.rear in /usr/share/alsa/alsa.conf, ~/.asoundrc, and /etc/asound.conf.

For "jack server is not running or cannot be started" or "connect(2) call to /dev/shm/jack-1000/default/jack_0 failed (err=No such file or directory)" or "attempt to connect to server failed", these are caused by ALSA trying to connect to JACK, and can be safely ignored. I'm not aware of any simple way to turn those messages off at this time, besides entirely disabling printing while starting the microphone.

On OS X, I get a ChildProcessError saying that it couldn't find the system FLAC converter, even though it's installed.

Installing FLAC for OS X directly from the source code will not work, since it doesn't correctly add the executables to the search path.

Installing FLAC using Homebrew ensures that the search path is correctly updated. First, ensure you have Homebrew, then run brew install flac to install the necessary files.


To hack on this library, first make sure you have all the requirements listed in the "Requirements" section.

  • Most of the library code lives in speech_recognition/__init__.py.
  • Examples live under the examples/ directory, and the demo script lives in speech_recognition/__main__.py.
  • The FLAC encoder binaries are in the speech_recognition/ directory.
  • Documentation can be found in the reference/ directory.
  • Third-party libraries, utilities, and reference material are in the third-party/ directory.

To install/reinstall the library locally, run python setup.py install in the project root directory.

Before a release, the version number is bumped in README.rst and speech_recognition/__init__.py. Version tags are then created using git config gpg.program gpg2 && git config user.signingkey DB45F6C431DE7C2DCD99FF7904882258A4063489 && git tag -s VERSION_GOES_HERE -m "Version VERSION_GOES_HERE".

Releases are done by running make-release.sh VERSION_GOES_HERE to build the Python source packages, sign them, and upload them to PyPI.


To run all the tests:

python -m unittest discover --verbose

Testing is also done automatically by TravisCI, upon every push. To set up the environment for offline/local Travis-like testing on a Debian-like system:

sudo docker run --volume "$(pwd):/speech_recognition" --interactive --tty quay.io/travisci/travis-python:latest /bin/bash
su - travis && cd /speech_recognition
sudo apt-get update && sudo apt-get install swig libpulse-dev
pip install --user pocketsphinx monotonic && pip install --user flake8 rstcheck && pip install --user -e .
python -m unittest discover --verbose # run unit tests
python -m flake8 --ignore=E501,E701 speech_recognition tests examples setup.py # ignore errors for long lines and multi-statement lines
python -m rstcheck README.rst reference/*.rst # ensure RST is well-formed

FLAC Executables

The included flac-win32 executable is the official FLAC 1.3.2 32-bit Windows binary.

The included flac-linux-x86 and flac-linux-x86_64 executables are built from the FLAC 1.3.2 source code with Manylinux to ensure that it's compatible with a wide variety of distributions.

The built FLAC executables should be bit-for-bit reproducible. To rebuild them, run the following inside the project directory on a Debian-like system:

# download and extract the FLAC source code
cd third-party
sudo apt-get install --yes docker.io

# build FLAC inside the Manylinux i686 Docker image
tar xf flac-1.3.2.tar.xz
sudo docker run --tty --interactive --rm --volume "$(pwd):/root" quay.io/pypa/manylinux1_i686:latest bash
    cd /root/flac-1.3.2
    ./configure LDFLAGS=-static # compiler flags to make a static build
cp flac-1.3.2/src/flac/flac ../speech_recognition/flac-linux-x86 && sudo rm -rf flac-1.3.2/

# build FLAC inside the Manylinux x86_64 Docker image
tar xf flac-1.3.2.tar.xz
sudo docker run --tty --interactive --rm --volume "$(pwd):/root" quay.io/pypa/manylinux1_x86_64:latest bash
    cd /root/flac-1.3.2
    ./configure LDFLAGS=-static # compiler flags to make a static build
cp flac-1.3.2/src/flac/flac ../speech_recognition/flac-linux-x86_64 && sudo rm -r flac-1.3.2/

The included flac-mac executable is extracted from xACT 2.39, which is a frontend for FLAC 1.3.2 that conveniently includes binaries for all of its encoders. Specifically, it is a copy of xACT 2.39/xACT.app/Contents/Resources/flac in xACT2.39.zip.


Uberi <me@anthonyz.ca> (Anthony Zhang)
arvindch <achembarpu@gmail.com> (Arvind Chembarpu)
kevinismith <kevin_i_smith@yahoo.com> (Kevin Smith)
DelightRun <changxu.mail@gmail.com>
kamushadenes <kamushadenes@hyadesinc.com> (Kamus Hadenes)
sbraden <braden.sarah@gmail.com> (Sarah Braden)
tb0hdan (Bohdan Turkynewych)
Thynix <steve@asksteved.com> (Steve Dougherty)
beeedy <broderick.carlin@gmail.com> (Broderick Carlin)

Please report bugs and suggestions at the issue tracker!

How to cite this library (APA style):

Zhang, A. (2017). Speech Recognition (Version 3.8) [Software]. Available from https://github.com/Uberi/speech_recognition#readme.

How to cite this library (Chicago style):

Zhang, Anthony. 2017. Speech Recognition (version 3.8).

Also check out the Python Baidu Yuyin API, which is based on an older version of this project, and adds support for Baidu Yuyin. Note that Baidu Yuyin is only available inside China.

Author: Uberi
Source Code: https://github.com/Uberi/speech_recognition 
License: View license

#audio #python #recognition 

Speech Recognition Module for Python, Supporting Several Engines, APIs
Grace  Edwards

Grace Edwards


How to Implement Use Deep Face Recognition with Redis and Python

Key-value stores come with high speed and performance against relational databases. In this video, we are going to build a facial recognition application with redis key value database. Because key value databases overperforms facial verification tasks.

We will use deepface framework for face recognition models. It wraps several state-of-the-art face recognition models: vgg-face, google facenet, openface, facebook deepface, deepid, dlib and arcface. We will use FaceNet in this video.

We will retrieve the vector representations of faces from redis, and compare it to a target image in the client side - python.

#python #redis #deepface #recognition 

How to Implement Use Deep Face Recognition with Redis and Python
Grace  Edwards

Grace Edwards


How to Implement Deep Face Recognition with Apache Cassandra in 2021

In this video, we are going to mention how to use Apache Cassandra wide column store for facial recognition tasks. Notice that key value stores such as Redis and Cassandra over-perform face verification tasks. Here, 
Cassandra is a wide column store. In contrast to Redis key value store, it can store multiple columns for a row.

We will use deepface library for python to represent facial images as vector. Then, we will store vector representations in Cassandra.

#cassandra #apache #recognition 

How to Implement Deep Face Recognition with Apache Cassandra in 2021
Nikita  Koelpin

Nikita Koelpin


Python face recognition in 10 minutes

Face recognition tutorial with Python using OpenCV in only 10 minutes from your live webcam. Thank you for watching.

Requiements.txt: https://github.com/ageitgey/face_recognition/blob/master/requirements.txt

Activate venv with these commands:

  • Windows: mypthon\Scripts\activate
  • OSX/Linux: source mypython/bin/activate

Learn how to make your own Discord bot: https://www.youtube.com/watch?v=_o8lwjVNJsg

Follow me on my new Twitter account 😊:

#enterflash #face #recognition #python #opencv #agitgey

#enterflash #recognition #python #opencv #agitgey #face

Python face recognition in 10 minutes
Murtaza Hassan

Murtaza Hassan


Face Recognition and Attendance System using OpenCV

In this tutorial we are going to learn how to perform Facial recognition with high accuracy. We will first briefly go through the theory and learn the basic implementation. Then we will create an Attendance project that will use webcam to detect faces and record the attendance live in an excel sheet.

#opencv #face #recognition #attendance

Face Recognition and Attendance System using OpenCV
Agnes  Sauer

Agnes Sauer


What is an OCR?? (Optical Character Recognition)

The necessity of digitisation is rapidly increasing in the modern era. Due to the growth of information and communication technologies (ICT) and the wide availability of handheld devices, people often prefer digitized content over the printed materials including books and newspaper. Also, it is easier to organize digitized data and analyze them for various purposes with many advanced techniques like artificial intelligence etc. So to keep up with the present technological scenario, it is necessary to convert all the information present till now which is in the printed format to digitised format.

Here comes OCR ….Our saviour💪 💪 which helps us in performing the tedious work of digitising the information. OCR stands for **Optical Character Recognition, **whose primary job is to recognise the printed text in an image. Once we recognise the printed text with the help of OCR, we can use that information in various types.

Image for post


What are you going to learn?

This is a 3-part series of articles that explains various concepts and phases of an OCR system. Let’s have a look at what you are going to learn in each part

  • part-I (this article), a high-level theoretical overview of the working of an OCR system
  • part-I**_I: _**different steps performed in the **_Pre-processing stage _**along with code samples
  • part-III: different types of Segmentation that can be performed on a pre-processed image.

Let’s go…

Below image shows the different phases in the workflow of an OCR system.

#machine-learning #ocr #image-processing #recognition #segmentation #deep learning

What is an OCR?? (Optical Character Recognition)

Is There A Case Of Regulating Facial Recognition Technology?

Being one of the most scrutinised technologies of the current era, the debate against facial recognition has been raging for quite some time. However, the recent notable incident of “killing of George Floyd by a Minneapolis police officer” has brought in the urgency for framing a strict regulation and guidelines against using this technology by law enforcement. Nevertheless, in the current era, this divisive technology has penetrated almost every aspect of human lives — smartphones, airports, police stations, advertising, and payments. It has also replaced the dated technology of biometrics amid COVID pandemic. But, the growing concerns of the recent incident has urged tech giants to reckon their decisions of building and offering this technology to police authorities.
Read more: https://analyticsindiamag.com/is-there-a-case-of-regulating-facial-recognition-technology/

#facial #recognition #technology

Is There A Case Of Regulating Facial Recognition Technology?

Everything So Far In CVPR 2020 Conference

Computer Vision and Pattern Recognition (CVPR) conference is one of the most popular events around the globe where computer vision experts and researchers gather to share their work and views on the trending techniques on various computer vision topics, including object detection, video understanding, visual recognition, among others.

Read more: https://analyticsindiamag.com/everything-so-far-in-cvpr-2020-conference/

#artificial-intelligence #conference #computervision #techniques #recognition #imageclasification

Everything So Far In CVPR 2020 Conference