Echo, Siri and Cortana are popular VUIs. However, all these assistants were designed with different goals in mind. Echo was designed as a voice first interface while Siri was designed as just another way of interacting with your iPhone. This is changing.
Speech recognition refers to what the recognition system hears. The recognition engine returns a set of responses for every user query. As this technology improves, the challenge of designing a great VUI lies in how the system responds. Natural language understanding(NLU) is how the VUI interprets those responses. It is the way input is handled, rather than the accuracy of correctly transcribing what was said.
Handling user responses
When you say “Read me this article”, I can interpret this as “Read me this article” or “Reed me this article.” The second interpretation does not really make sense. As an intelligent person, I am expected to understand the context and the logic of the statement to respond to it. Likewise, an intelligent system needs to know what the user meant and respond to it accordingly. Here are the different types of responses that the system needs to handle:
For many of the questions that a system asks, there are only a finite set of responses that are logically valid. For example, when the system asks “what is your favorite animal?” or “Do you want to go outside?” Saying “orange” is not a logical response. Providing a mechanism to capture and interpret responses and discard irrelevant ones is especially challenging for a voice based system where the user can choose from a seemingly infinite response set. In the first case, the VUI will have a list of accepted animal names, while in the second one we only need to look at variations of “yes” and “no.” If the user says something outside of the finite set of responses, it can be handled with a null response like “I’m sorry I don’t understand.”
Here are some examples that require constrained responses:
- What is your favorite fruit? : Mango, Apple, Banana, etc.
- What country are you from? : India, USA, China, etc.
- What song do you want to hear? : Fix you, Lasya, Roja, etc.
- Would like to book the tickets? : Yes, No, Naah, Oh yeah, etc.
The response sets might be lengthy, however they still are finite.
One case is when you want the conversation to have a natural flow, but do not explicitly want to handle input. For example, the assistant might say “Hey there! Long time no see!” to which the user might reply “Nothing much” or “Gotta go work now.” In that case, the assistant can give a generic reply like “Hmm.. I see” or “Ok alright.” The user response is not critical for the conversation to continue, the user could say anything and the logic of the response should still be valid. A generic answer is alright. Another strategy for a generic reply would be during confirmations like “Alright! Done!” or “I’ll send this information to our customer service team.”
Categorization of input
A good strategy to handle user input is to categorize inputs in broadly defined buckets such as positive/negative, happy/sad/excited, good/bad, etc. The VUI simply looks to map to a category rather than give an exact response.
For example, the VUI can ask “How was your experience at our restaurant?”:
- Good: Good, Amazing, Terrific, Awesome, etc.
- Bad: Depressing, Bad, Irritating, etc.
The assistant can then respond accordingly.
Try not announcing what the user is feeling already. For example, when the user says that the experience was bad, do not say “It seems you’ve had a bad experience. Let us know how we can improve.” The user has already indicated the mood, it’s unnatural to repeat it. Instead try saying something reassuring like “I’m sorry to hear that. Want to tell us more?”
Looking for specific keywords or phrases is a simpler method, however it is important for a voice based system to allow for more complex queries. For example, the intent for the following queries is the same:
- “My computer is really slow”
- “My computer is really really slow”
- “Computer is slow. What to do?”
Booking a cab or ordering food are simple intents, however there might be variations in the way a user asks for information. It would be a huge task to write a condition for each of these variations. Instead, the system could build a recognition for common patterns.
Imagine a VUI asking you “How was you experience at the restaurant?” and you say “Not very good.” The VUI designers have not considered this response and pick up the keyword “good” and responds by saying “Awesome! Thanks!” The VUI already sounds stupid and the user might become wary of trusting the assistant. Handling negation is a much more difficult task, but the cost of ignoring it is high.
The word disambiguation literally means removing uncertainty and is arguably is one of the most important problems that voice interfaces need to tackle. A simple example would be placing a phone call. If you ask Siri to “call John” and there are multiple Johns in your contact list, it would ask you “which John?” followed by the full names of each of the contacts disambiguated.
The system also needs to disambiguate in cases where the user provides insufficient information or excessive information. For example, if a user says “I’d like a large pizza” for which the natural followup question could be “what kind of pizza would you like?” This is a case of insufficient information which the assistant can resolve by asking a leading question. In the case where a the user gives excessive information which the system is not built to handle, the system can ask the user to provide only one piece of information at a time. However, it would be more beneficial to account for multiple pieces of information.
For more complex VUIs, you need smarter ways of handling speech input. In many cases like messaging, there are multiple things you can do with a messaging app. You could say, “Send a message to mom” or “Read my last message” or “Have I got any messages?” In each case, the intent is different and handling these queries by searching for the keyword “message” might not be the best strategy. Instead the VUIs NLU model should be trained to handle each of these queries as separate intents.
In cases where user utters multiple pieces of information at once, the NLU model should be able to handle the query and capture objects to be used for the intent. For example, if a user says “Order me a large cappuccino from Starbucks at home,” the user has already specified the type of coffee, size, restaurant name and place of delivery. The system should be able to pre-fill this information.
You can use tools like Api.ai, Microsoft LUIS, Nuance Mix, Wit.ai, etc. to build and test these models.
Wake words are often used to invoke VUI system. For example, “Alexa” is the wake word for Amazon Echo, while “Hey Google” or “Ok Google” are wake words for Google assistant. Using a wake word is one of the ways to start an interaction with the VUI system without having to touch any device.
Following are the things to keep in mind when designing a wake word:
- It should be easily recognizable. Short words like “Jim” or “Will” are difficult to recognize.
- It should be easy for users to say it.
- Use words with multiple syllables. Take note of Alexa or Siri’s wake words, they all have multiple syllables.
- Don’t choose words that people might say regularly in conversations.
Another important thing to note is that wake words should be handled locally. You device should always listen for the wake word and then start recording the user’s voice to send it to the cloud for processing. Always recording and sending data to the cloud is unethical and will lead to serious distrust among users.
TTS versus Recorded voice
Another important decision you need to make is whether to use Text-to-speech (TTS) or a recorded voice to answer user queries. Although a recorded voice feels more natural, it is expensive and time consuming to record for all answers. TTS on the other hand can work realtime, but sounds robotic. Although it is improving, TTS still has difficulty pronouncing certain words, emotion is difficult to indicate.
TTS can be improved by applying Speech Synthesis Markup Language (SSML), which can help add more natural sounding intonations and pronunciations on the fly. Despite this, there are still words and phrases that the TTS engine might have difficulty with and it might be necessary for you to build a pronunciation dictionary.
As a rule of thumb, it is generally a good strategy to use a combination of TTS and recorded voice. Recordings can be used for most common responses like confirmations and greetings. Apart from this, it is also a good strategy to build a voice font of your recording artist, in case you are using a hybrid model.