Beginner’s Guide to NVIDIA NeMo

The piece provide you with a glimpse on the fundamental concepts behind NVIDIA NeMo. It is an extremely powerful tookit when it comes to building your own state of the art models for conversational AI. For your information, a typical conversational AI pipeline consists of the following domains:

Automated Speech Recognition (ASR)
Natural Language Processing (NLP)
Text to Speech (TTS)

If you are finding for a full-fledged toolkit to train or fine-tune model for these domains, you might want to have a look at NeMo. It allows researchers and model developers to build their own neural network architectures using reusable components called Neural Modules (NeMo). Based on the official documentation, neural modules are

“… conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations.”

One major plus point for NeMo is that it can be used to train new model or perform transfer learning on existing pre-trained models. On top of that, there are quite a number of pre-trained models available for your usage at NVIDIA GPU Cloud (NGC). At the time of this writing, the GPU-accelerated cloud platform has the following pre-trained models:

ASR

Jasper 10x5 — Librispeech
Multi-dataset Jasper 10x5 — LibriSpeech, Mozilla Common Voice, WSJ, Fisher, and Switchboard
AI-shell2 Jasper 10x5 — AI-shell2 Mandarin chinese
Quartznet — Librispeech with speed perturbation
QuartzNetLibrispeechMCV — Librispeech, Mozilla common voice
Multi-dataset Quartznet — LibriSpeech, Mozilla Common Voice, WSJ, Fisher, and Switchboard
WSJ-Quartznet — Wall street journal,Librispeech, Mozilla common voice
AI-shell2 Quartznet — AI-shell2 Mandarin chinese

#python #speech-recognition #nlp #text-to-speech #machine-learning

ASR

towardsdatascience.com

Beginner’s Guide to NVIDIA NeMo