Understanding speech recognition to design better Voice interfaces

When designing a good voice user interface, it is always advantageous to know how the technology works.

Knowing what goes on behind the scenes enables you to make design decisions that take into account the current limitations and advantages of the technology.

Speech recognition technology

One key component of a Voice user interfaces(VUI) is automated speech recognition (ASR) that enables users’ speech to be translated into text. There are a number of free and paid services available that provide ASR engines. When choosing an engine, it is important to keep in mind the following two things:

  1. Robustness and accuracy of data
     As a rule of thumb, the more data the company has, the better it’s speech recognition will be. Incumbent companies are generally good at amassing a large quantity of data.
  2. A good endpoint detection
     Endpoint detection refers to how the ASR engine knows when the user has begun or finished speaking. Try to go for a service that provides the best endpoint detection.

Again, not all ASR tools will have advanced features like N-best lists, end of speech timeouts or the ability to incorporate custom vocabularies. It might be quicker to start with the cheapest ASR tool, however if the recognition accuracy is sub-standard, or the endpoint detection does not work very well, it is going to frustrate your users to eventually give up on the product.

Barge-in

Barge-In is the ability built into a VUI that allows for the user to interrupt the system anytime during the conversation. The decision to enable this will greatly depend on the type of VUI you are planning. It is advantageous if your VUI is going to say a long list of menu items, or tell a story or generally be verbose. Users might want to interrupt and stop it midway.
When deciding on a barge-in strategy, you need to keep in mind whether you want to enable barge-in with anything that the user says, or only using the wake word. Most common VUIs use the latter strategy of using a wake word. When Alexa is playing a song, the user can barge-in anytime to stop playing. Without barge-in, there would be no way to stop playing by using a voice command.

Timeouts

VUIs need to know when a user starts speaking as well as when the speech has ended. Knowing when the user stops speaking is known as a timeout. Giving an optimum timeout is critical to a good VUI experience. Think of a video call where the voice of the other person lagged and it was difficult to follow the conversation.
There are different conditions at which the ASR engine can decide to timeout:

  1. End of speech timeout
     Knowing when the user has finished talking, i.e. finished their turn in the conversation is one of the most important characteristics of a good VUI system. This is sometimes referred to as endpoint detection. Giving a response as soon as the user has stopped speaking is unnatural. The system needs to pause for sometime before continuing the conversation. It is a basic conversation etiquette. Some ASR engines allow you to configure this pause, also known as end-of-speech timeout. Using a 1.5 seconds end-of-speech timeout is a good rule of thumb. However, there might be cases when you’d need a longer end-of-speech timeout, such as when saying a long string of characters or numbers. For cases where the user only needs to give a one word response such as a ‘yes’ or a ‘no’, a shorter timeout works fine.
  2. No speech timeout (NSP)
     As the name indicates, this timeout is used to detect if there is no speech detected. It is different from an end-of-speech timeout, where there is a concrete beginning and an end to the user speech. This timeout is usually longer at 10 seconds. There are different ways in which these timeouts can be handled ranging from showing the user a list of things or actions that can be done to doing nothing at all.
  3. Too much speech
     This is a rare case when the user is talking for too long without any pauses. In most scenarios, you don’t need to handle Too much speech (TMS) timeouts. However if you want to incorporate this then a good rule of thumb is 7–10 seconds at which the system times out.

Incorporating timeouts is essential to know when the user has stopped speaking.

N-best lists

When a user speaks with the VUI, the speech recognition system returns more than one response to what was said. It assigns a confidence value to each result and usually picks the one that has the highest confidence value. In simple terms, a confidence value is a percentage that indicates how confident the system is about a particular result. For example, when you say “Read me a book,” the system can interpret it as follows:

  1. Read me a book : 95% confidence
  2. Reed me a book : 70% confidence
  3. Rid mia boo : 30% confidence

If you’ve designed your VUI to read books then the system would pick up the first result.
A recognition engine often does not return only one result. It returns an N-best list, which is a list of what it thinks the user might’ve said in the order of likelihood. It is usually the top 5–10 results along with a confidence score.
N-best lists are useful in cases where you’ve designed the system to answer in a narrow domain. For example, in a VUI that gives information about animals, when you say “Show me a Badger,” the ASR tool might interpret it as follows:

  1. Show me a badge her : 92% confidence
  2. Show me a badger : 89% confidence

Since you already know that this VUI is about animal information, it can search for cues for animal names and pick the second result even if it does not have the highest confidence level.
Another use of N-best list is in correcting information in case the first answer is not valid.

read original article here