Dictation has come a long way since its earliest days when stenographers would furiously write everything that a person said in as quick a shorthand as possible. Speech-to-text technology has advanced greatly over the past few decades, to the point where we now have automated speech recognition.
But what exactly is automated speech recognition, and why is it important? This article intends to explain what automated speech recognition is.
Table of Contents
An overview of automated speech recognition
Automated speech recognition is software that is used to parse through spoken language to turn it into understandable text. You will find examples of it everywhere, from automated voice response systems (“Please state the reason for your call”) to your smart appliances.
The applications for automated speech recognition are numerous. The most obvious application, that of using it to transcribe audio files into written text, can allow for ease of captioning for live and pre-recorded videos. This works well both in business (streamlining the process of making training videos) and in personal life (allowing for closed captioning at the touch of a button). For an editor, having a transcript that updates in real-time can allow for the ease of adding timestamps to videos.
Advances in automated speech recognition have allowed the technology to advance far beyond what IBM presented, for example, when it showed the 1962 World’s Fair its Shoebox machine. Modern automated speech recognition has the ability to recognize context as well as words. A good ASR can recognize when a sentence ends and can, through indications of tone and word choice, determine the appropriate punctuation to use. ASR can also determine when to capitalize a proper noun.
How such advances have occurred in ASR
Advancements in other fields of machine learning have allowed for ASR to advance as it has. In particular, deep learning and big data have allowed for advanced ASR.
Deep learning is a form of machine learning that allows for a network (essentially a substitute for the human brain) to “learn” how to process unsorted data in a logical way. And big data is a term for the large data sets that accumulate over time; to give an example, Walmart accumulates big data through the immense amount of sales they make per day and accumulate that data to better understand how to market towards their customers.
Through deep learning, machines have developed a better understanding of human language and gained the ability to apply appropriate context when transcribing. And big data has provided the raw information which is necessary to ensure that machines continue to have a source from which they can learn.
Types of ASR
Multiple varieties of ASR exist, some of which are mostly obsolete, while some work better depending on the context. The most widely-popular at the moment is based on Hidden Markov Models. Essentially, a Hidden Markov Model (HMM)-based ASR recognizes small fragments of words, phonemes, and puts them together in a logical way through deep learning, gaining more knowledge about the proper way to put words together.
End-to-End ASR instead attempts to have the words fully-constructed. Most phones use End-to-end. Because of the huge amount of storage required for end-to-end ASR, the average cell phone cannot hold an ASR on them; that is why you usually must be connected to a wireless network to use your speech recognition tools on your phone.
Conclusion
While the computer terminology may be complicated, the applications for ASR are very easy to understand. With the constant consumer demand for easy-to-use technology, having ASR can allow for such disparate applications as medical dictation, the guidance of incoming airplanes, home automation, and gaming. The convenience, utility, and necessity of having ASR cannot be overstated.