Exploring the Web Speech API

The Web Speech API is one of those web technologies that no one ever talks about or writes about. In this blog post, we are going to take a closer look at what the API is capable of, what its limitations and strengths are, and how web developers can utilize it to enhance the user’s browsing experience.

“The Web Speech API enables you to incorporate voice data into web apps. The Web Speech API has two parts: SpeechSynthesis (Text-to-Speech), and SpeechRecognition (Asynchronous Speech Recognition)” — Mozilla Developer Network

Can the Web Speech API be used to interact with complex web forms? This is our research question for this blog post, and we are going to use both parts of the API to answer it. But let’s not get ahead of ourselves, let’s start by learning how the API actually works.

Making browsers talk with Speech Synthesis

https://vimeo.com/387398084

Web Speech API - Synthesis

Note: Left: Microsoft Edge | Right: Google Chrome

Demo: https://experiment-web-speech.now.sh/pages/blog.html

Controlling the browser’s voice using Speech Utterance

We are going to use the Speech Synthesis API to read out one of our previous blog posts. The speechSynthesis interface is the service controller. This interface can be used to retrieve the available synthesis voices on the user’s device, play and pause the speech, and much more.

The SpeechSynthesisUtterance interface represents a speech request. It contains the text that the speechSynthesis service will read out, and it contains basic information like the voice’s pitch, volume, language, and rate.

const speechSynthesis = speechSynthesis;
const speechUtterance = new SpeechSynthesisUtterance();

function isPreferredVoice(voice) {
  return ["Google US English", "Microsoft Jessa Online"].any(preferredVoice =>
    voice.name.startsWith(preferredVoice)
  );
}

function onVoiceChange() {
  speechSynthesis.addEventListener("voiceschanged", () => {
    const voices = speechSynthesis.getVoices();
    speechUtterance.voice = voices.find(isPreferredVoice);
    speechUtterance.lang = "en-US";
    speechUtterance.volume = 1;
    speechUtterance.pitch = 1;
    speechUtterance.rate = 1;
  });
}

When a website is fully loaded, the speechSynthesis API will fetch all available voices asynchronously. Once done, it will fire a voiceschanged event letting us know that everything is ready to go. Utterances added to the queue before this event is fired will still work. They will, however, use the browser’s default voice with its default settings.

The getVoices() method will return every loaded voice. This list contains both native voices and browser-specific ones. Not every browser provides custom voice services, both Google Chrome and Microsoft Edge do. Generally speaking, these voices sound much better, but you are sacrificing privacy for quality. They also require an internet connection.

Warning: While using Google Chrome’s custom voice service, each utterance instance has a character limit of 200-300. If the utterance’s text changes, the limit is reset. It is unknown whether this is a limitation or a bug.

The function above sets up our SpeechSynthesisUtterance instance once the voiceschanged event is fired. Using the Array.find() method I select my voice of choice, gracefully falling back to the browser’s default.

#api #web

Making browsers talk with Speech Synthesis

Controlling the browser’s voice using Speech Utterance

voorhoede.nl

Exploring the Web Speech API