Creating a simple speech to speech translation pipeline

Using microsoft azure services we will create the following pipeline:

ASR (automatic speech recognition)
MT (machine translation)
TTS (Text-to-speech)
(optional) Use neural voices to improve the output speech

My desire to speak other languages came when I was 18 and went to live in Accra, Ghana. I was a missionary for the Church of Jesus Christ of Latter-day Saints, called to serve the people I would meet there. In Ghana over 50 languages are recognized and where I lived (I moved four times) there were 3 or 4 main languages spoken. I wanted to connect with and serve these super great people, so I did what I could to learn their languages.

Almost 6 years now since I got back, I’m studying machine translation and looking to make my first attempt at making MT a thing for languages that I spoke like Twi. This article isn’t those efforts, but a starting point for people interested in machine translation.

This is not an in-depth look, or from scratch, article either. This is showing how to use some very powerful tools in a basic way, just to get you started. Most of this can be found in following along with Microsoft’s own quickstarts for the azure speech service or pyaudio’s documenation.

Honestly, I’m very impressed with Microsoft azure. In a machine translation course that I’m taking we found that for several languages (maybe more, but we haven’t tested them all) it was outpacing Google translate in accuracy (mainly tested with bleu scores). Plus the TTS allows you to improve the voice, which is a ton of fun!

Let’s get started with code and a basic understanding of what is happening.

ASR

Automatic speech recognition takes some pretty serious work. If you want to work with a little more exposed example you can try using pyaudio. They are awesome for experimenting with. Let me show you an example of that and then the Microsoft one and you can decide which you want to proceed with.

_This file above lets you record and listen to .wav files. It’s nice in that you see a little more behind the scenes of what is going on. There are lots of wonderful articles and writings that explain this better than I will take the time to here. One little technicality to note is that if you are using vs code on a mac, and you try to call record your mic won’t turn on. Try _this if that is case.

MT

Machine translation, this is a massive field of research all on its own, but that is a discussion for another time in your life. What is so cool is that the training and fine tuning of a very powerful NMT is accessible to us, and it can help us start our MT journey. So, for now, rather than building an NMT from scratch, we will focus on using azure’s speech MT. This requires a ‘translation’ resource which you set up the same way you set up your ASR. A reminder to wait while it finishes setting up so that you can find it later (I made that mistake and took a long time wondering where it went only to have to make it again).

TTS

Text-to-speech is a very fun part of this, especially if you continue to option four. You may be able to use your original speech api key, but you’ll need to have a specific region region set to use neural voices. You can find your best region at that link and set it when creating this speech resource.

Combined pipeline

Now to put it all together, a simple file helps us easily call each service. Some parts have been commented out that you can play with to see both how pyaudio works and the changes other options make.

#translation #nmt #azure #machine-translation

ASR

MT

TTS

Combined pipeline

medium.com

Creating a simple speech to speech translation pipeline