Introduction to Voice User Interfaces (Part — 2)

Introduction to concepts like Phonetics, Language Models, Acoustic Models, Deep Neural Networks as Speech Models, Lexical Decoding and Challenges in ASR

Photo by Jonas Leupe on Unsplash

One of the most popular application areas for voice system today is conversational AI. Graph based interaction mainly focuses on asking pointed questions in a prescribed order and only accepting specific terms as responses. We’ve seen this plenty of times before when we can’t move forward in the system until we provide our user ID or we can’t specify our destination until we provided the location we’ll be starting from.

Here’s the example:

Alexa, open graph travel agent.

Where are you flying from?

I want to fly from Columbus to San Francisco on October 9th.

You’re flying from Columbus. Got it. Where do you want to go?

San Francisco on Delta flight 722.

Okay. You’re trying to find a flight from Columbus to San Francisco. Which airline do you want to use?

We can see how that can get a little frustrating. The alternative frame based allows the user to drive the interaction. We can say the words that make sense. We can make requests in the order we prefer when it makes sense and can jump to the part of the menu we want to be in without having to memorize a list of options.

Alexa. Open frame travel agent.

Where are you flying from?

I want to fly from Columbus to San Francisco on October 9th.

Okay. You’re trying to find a flight from Columbus to San Francisco on October 9th. Which airline do you want to use?

That’s so exciting. It’s exciting because it means there’s a lot of flexibility in what we can build, when we design our own Alexa skills. But how do we know what we want to have Alexa listen for? And how does that translate to what we want in response? In the example, you just showed Alexa was very accommodating. One of the cool things about building a Voice User Interface or VUI, is how we define interactions with our users. We get to define a series of actions that the user can perform which we call intense. If our voice skill was a DVD player, our intense might be play, pause, stop, and eject. Our DVD player doesn’t have a pizza button because DVD players don’t traditionally make pizzas. We then create a set of sample statements which we call utterances that help Alexa understand which intent to use when a user says something. Going back to the DVD player example, someone might say “start the movie” and they would expect the play intent to happen. But they might also say play or go or begin or it’s show time and they would expect to have the same reaction.

Photo by Jonas Leupe on Unsplash

The intention is to play the movie. These example utterances help elect to understand which intent should be called when the user says something to a skill. So if we think about a sort of script we want our user to have with Alexa, we can come up with a number of different ways they might converse. I guess more user testing could help us expand our list of better answers as well. How do we model what the user will say? The surprising or maybe not so surprising answer is to grab a couple of our colleagues, friends, and complete strangers and ask them to naturally request what outcome they need. Take some notes because we will quickly discover that every person is going to approach the process of this a slightly different way, and our skills should be able to accommodate all of them.

Challenges in ASR

Continuous speech recognition has had a rocky history. In the early 1970’s, the United States funded ASR research with a DARPA challenge. The goal was to develop a recognizer for a 1000 word vocabulary. The goal was achieved a few years later by Carnegie-Mellon’s Harpy System. But the future prospects were disappointing and funding dried up. This was the start of the first big AI Winter. Performance improved in the 80’s and 90’s with refinement of probabilistic models. More recently computing power has made larger dimensions in neural network modeling a reality. So what makes speech recognition hard?

The first set of problems to solve are related to the audio signal itself, noise for instance. Cars going by, clocks ticking, other people talking, microphones static, our ASR has to know which parts of the audio signal matter and which parts to discard.

Variability of pitch and volume

One speaker sounds different than another even when saying the same word. The pitch and loudness at least in English don’t change the ground truth of which word was spoken. If I say hello in three different pitches, it’s all the same word and spelling. We could even think of these differences as another kind of noise that needs to be filtered out.

Variability of words speed

Words spoken at different speeds need to be aligned and matched. If I say speech or speech, it’s still the same word with the same number of letters. It’s up to the ASR to align the sequences of sound correctly.

Word boundaries

When we speak words run from one to the next without pause. We don’t separate them naturally. Humans understand it because we already know that the word boundaries should be in certain places. This brings us to another class of problems that are language or knowledge related. The fact is, humans perceive speech with more than just their ears. We have domain knowledge of our language that allows us to automatically sort out ambiguities as we hear them. Words that have the same sound but different spellings. Word groups that are reasonable in one context but not in another.

Photo by BENCE BOROS on Unsplash

Here’s a classic example. When I say recognize speech very fast, it sounds a lot like recognize speech. But you knew what I meant because you know I’m discussing speech recognition. The context matters. An inference like this is going to be tricky for a computer model. Another aspect to consider. Spoken language is different than written language. There are hesitations, repetitions, fragments of sentences, slips of the tongue, a human listener is able to filter this out. Imagine a computer that only knows language from audio books and newspapers read aloud. Such a system may have a hard time decoding unexpected sentence structures.

Okay, we’ve identified lots of problems to solve here. Variability of pitch, volume, and speed, ambiguity due to word boundaries, spelling, and context. We’re going to introduce some ways to solve these problems with a number of models and technologies. We’ll start at the beginning with the voice itself.


Phonetics is the study of sound in human speech. Linguistic analysis of language around the world is used to break down human words into their smallest sound segments. In any given language, some number of phonemes define the distinct sounds in that language. In US English, there are generally 39 to 44 phonemes to find. A Grapheme, in contrast, is the smallest distinct unit that can be written in a language. In US English the smallest grapheme set we can define is a set of the 26 letters in the alphabet plus a space. Unfortunately, we can’t simply map phonemes to grapheme or individual letters because some letters map to multiple phonemes sounds and some phonemes map to more than one letter combination. For example, in English the C letter sounds different in cat, chat, and circle. Meanwhile, the phoneme E sound we hear in receive, beat, and beat, is represented by different letter combinations.

Lexical Decoding

Here’s a sample of a US English phoneme set called Arpabet. Arpabet was developed in 1971 for speech recognition research, and contains thirty nine phonemes, 15 vowel sounds and 24 consonants, each represented as a one or two letter symbol. Phonemes are often a useful intermediary between speech and text. If we can successfully produce an acoustic model that decodes a sound signal into phonemes, the task remaining would be to map those constructed phonemes to their matching words. This step is called Lexical Decoding, and is based on a lexicon or dictionary of the data set.

Why not just use our acoustic model to translate directly into words? Why take the intermediary step?

That’s a good question and there are systems that do translate features directly to words. This is a design choice and depends on the dimensionality of the problem. If we want to train a limited vocabulary of words we might just skip the phonemes, but if we have a large vocabulary converting to smaller units first, reduces the number of comparisons that need to be made in the system overall.

Acoustic Models and the Trouble with Time

With feature extraction, we’ve addressed noise problems due to environmental factors as well as variability of speakers. Phonetics gives us a representation for sounds and language that we can map to. That mapping, from the sound representation to the phonetic representation, is the task of our acoustic model. We still haven’t solved the problem of matching variable lengths of the same word.

Dynamic time warping algorithm calculates the similarity between two signals, even if their time lengths differ. This can be used in speech recognition, for instance, to align the sequence data of a new word to its most similar counterpart in a dictionary of word examples. As we’ll soon see, hidden Markov models are well-suited for solving this type of time series pattern sequencing within an acoustic model, as well.

This characteristic explains their popularity in speech recognition solutions for the past 30 years. If we choose to use deep neural networks for our acoustic model, the sequencing problem reappears. We can address the problem with a hybrid HMM/DNN system, or we can solve it another way. We’ll review HMMs and how they’re used in speech recognition.

Language Models

So far, we have tools for addressing noise and speech variability through our feature extraction. We have HMM models that can convert those features into phonemes and address the sequencing problems for our full acoustic model. We haven’t yet solved the problems in language ambiguity though. The ASR system can’t tell from the acoustic model which combinations of words are most reasonable.

Photo by Jessica Ruscello on Unsplash

That requires knowledge. We either need to provide that knowledge to the model or give it a mechanism to learn this contextual information on its own.

Deep Neural Networks as Speech Models

If HMM’s work why do we need a new model. It comes down to potential. Suppose we have all the data we need and all the processing power we want. How far can an HMM model take us, and how far could some other model take us?

According to by Baidu’s Adam Coates in a recent presentation, additional training of a traditional ASR levels off inaccuracy. Meanwhile, Deep Neural Network Solutions are unimpressive with small data sets but they shine as we increase data and model sizes. Here’s the process we’ve looked at so far. We begin by extracting features from the audio speech signal with MFCC. Use an HMM acoustic model to convert to sound units, phonemes, or words. Then, it uses statistical language models such as N-grams to straighten out language ambiguities and create the final text sequence.

It’s possible to replace the many tune parts with a multiple layer deep neural network. Let’s get a little intuition as to why they can be replaced. In feature extraction, we’ve used models based on human sound production and perception to convert a spectrogram into features. This is similar, intuitively, to the idea of using Convolutional Neural Networks to extract features from image data. Spectrograms are visual representations of speech. So, we ought to be able to let a CNN find the relevant features for speech in the same way. An acoustic model implemented with HMMs, includes transition probabilities to organize time series data. Recurrent Neural Networks can also track time series data through memory. The traditional model also uses HMMs to sequence sound units into words. The RNNs produce probability densities over each time slice.

So we need another way to solve the sequencing issue. A Connectionist Temporal Classification layer is used to convert the RNN outputs into words. So, we can replace the acoustic portion of the network with a combination of RNN and CTC layers. The end-to-end DNN still makes linguistic errors, especially on words that it hasn’t seen in enough examples. It should be possible for the system to learn language probabilities from audio data but presently there just isn’t enough. The existing technology of N-grams can still be used. Alternately, a Neural Language Model can be trained on massive amounts of available text. Using an NLM layer, the probabilities of spelling and context can be re scored for the system.

AI Engineer at DPS, Germany | 1 Day Intern @Lenovo | Explore ML Facilitator at Google | HackWithInfy Finalist’19 at Infosys | GCI Mentor @TensorFlow | MAIT, IPU