There are many ways to interact with apps. Obviously the keyboard has been, and still is, one of the most often-used devices to communicate with computers. More recently (in human years, not computer years) the mouse allowed us to move past text-based interfaces. And improving upon the mouse, touch has let us use more natural interactions. But when people are communicating with each other, in close proximity, we don’t type what we are thinking or point to it, we use spoken language. So why should it be any different with a computer?

Google Cloud Speech APIs

There are two sides to a spoken conversation: the speaker and the listener. The speaker must generate the sounds to express ideas, and the listener interprets those sounds to reconstruct the ideas. This might seem like overstating the obvious as people do this every day. But a computer doesn’t understand this and can’t understand it unless these interactions are explained to them in great detail. This is why there are two speech APIs in Google Cloud.

First is the Text-to-Speech (or TTS) API. This service converts written (or typed) text into sounds resembling a human voice. It offers over 200 voices in over 40 languages. It supports Speech Synthesis Markup Language, or SSML, which lets you annotate written text with a set of “stage instructions” that customize the sounds for a more realistic effect. This includes pauses in text and the pronunciation of acronyms and abbreviations.

The other API is the complement, Speech-to-Text (or STT). If TTS is the speaker, then STT is the listener. The STT API can transcribe speech in more than 125 languages. And like the TTS API, it can be customized. A common use case is recognizing jargon present in specific industries. The STT API can even transcribe streaming audio in real time.

#data-analysis #machine-learning

Using the Google Cloud Speech API
2.05 GEEK