Photo by Loewe Technologies on Unsplash
Automatic speech recognition (ASR) is the transformation of spoken language into text. If you’ve ever used a virtual assistant like Siri or Alexa, you’ve experienced using an automatic speech recognition system. The technology is being implemented in messaging apps, search engines, in-car systems, and home automation.
And though all these systems rely on slightly different technical processes, the first step for all of them is the same: capturing speech data and transforming it into machine-readable text.
But how does an ASR system work? How does it learn to understand speech?
So let’s get started!
ASR Systems: How do they work?
So we know that on a basic level, automatic speech recognition looks like this:
audio data in, text data out.
An acoustic model determines the relationship between audio signals and phonetic units in a language, while a language model matches sounds to words and word sequences.
These two systems allow ASR systems to run probability checks on audio input to develop predictions of what words and sentences are in it. From these predictions, the systems then selects the prediction with the highest confidence rating.*
*Sometimes the language model can give priority to certain predictions that are deemed more likely due to other factors
So if we run the a phrase through an ASR system, it will do the following:
- Take vocal input: “Hey Siri, what time is it?”
- Run the voice data through an acoustic model, breaking it up into phonetic parts.
- Run that data through a language model.
- Output text data: “Hey Siri, what time is it?”
It’s worth mentioning here that if an automatic speech recognition system is part of a voice user interface, the ASR model won’t be the only machine learning model at work. Many automatic speech recognition systems are paired with natural language processing (NLP) and Text-to-speech (TTS) systems to perform their given roles.
So now we know how ASR systems work, but what do you need to build one?
The key is data.
Building an ASR System: The Importance of Data
A good ASR system is expected to be flexible. It needs to understand a wide variety of audio input (speech samples) and create accurate text output from that data so it can react accordingly.
Oh, and let’s not forget that speech differs due to age and gender, too!
With this in mind, the more speech samples you feed an ASR system, the better it gets at identifying and classifying new speech input. The more samples you have from a broad range of voices and environments, the better the system gets at identifying voices within those environments. With dedicated fine-tuning and maintenance, automatic speech recognition systems will improve as they are used.
So at the most basic level, the more data, the better. It’s true that there is ongoing research into optimizing smaller datasets, but at present most models still require large amounts of data to perform well.
Fortunately, audio data collection is getting simpler thanks to dataset repositories and dedicated data collection services. This in turn is increasing the rate of technological development, so to finish things off, let’s take a brief look at where automatic speech recognition can play a role in the future.
The Future of ASR Technology
ASR Technology is already embedding itself into our society. Virtual assistants, in-car systems, and home automation are all creating convenience in everyday life. It’s likely that the scope of their abilities will expand too; as more people adopt these services the technology will develop further.
Outside of the above examples, automatic speech recognition is playing a role in a variety of interesting fields and industries:
- Communication: With the adoption of cell phones worldwide, ASR systems can make messaging, online searches, and text-based services available even to communities with low levels of reading and writing literacy.
- Improving Accessibility: Automatic speech recognition systems can also help people with disabilities or injuries by providing hands-free access to applications, and auto-captioning for television, movies, and business meetings.
- Military Technology: In the US, France, and the UK, military programs have been testing and evaluating ASR systems for fighter jets. This includes tasks such as setting radio frequencies, commanding autopilot systems, and controlling flight displays.
These are just a few examples of how ASR can support and improve lives, and it’s likely that the next decade will see even more improvements alongside novel adaptations.
In any case, I hope this article has been a good introduction to how ASR systems work, how to build them, and what to look forward to in the future. If you have any comments or thoughts, feel free to leave a comment below and I’ll get to it as soon as I can.