One of the first attempts to make computers recognize speech was focused on recognizing numbers! Bell Laboratories in 1952 designed the Audrey System, which could recognize a single voice speaking digits. There have been numerous other experiments since then which are well documented in this Wikipedia article. Fast forward to today, we have state-of-the-art Automatic Speech Recognition Engines (ASR) like Apple’s Siri, Google Assistant, and Amazon’s Alexa.

For a long-time, Google’s Speech-to-text API (STT) was the de facto choice for any ASR task. This slowly changed when open-source alternatives like Mozilla DeepSpeech came out in late 2017. It’s based on the original Deep Speech research paper by Baidu and is trained using (mostly) American English datasets, resulting in poor generalizability to other English accents.

For a recent internship, I had to integrate an ASR engine for a video-conferencing platform, which was to be used majorly by the Indian population. We were preferably looking for open-source alternatives, but most of the general ones performed poorly in real-time meetings. That’s when I came across DeepSpeech and the Indic TTSproject by IITM.

The Indic dataset contains more than 50 GB of speech samples with speakers from 13 Indian states. It comprises of 10000+ spoken English sentences of both Male and Female native speakers. These files are available in _.wav _format along with the corresponding text.

In this article, I’ll walk you through the process of fine-tuning DeepSpeech using the Indic dataset, but you can easily follow this for other English datasets too. You can signup on the IITM website and request them for the dataset.

Prerequisites: Familiarity with ASR Engines, Speech Processing, and a basic understanding of Recurrent Neural Networks and TensorFlow.

Note: All my training and pre-processing was done on Google Colab with DeepSpeech version 0.7.4

#speech-recognition #machine-learning #mozilla #nlp #python

Automatic Speech Recognition for the Indian Accent
22.90 GEEK