Artificially intelligent machines are becoming smarter in every day. Deep learning and machine learning techniques enable machines to perform many tasks at the human level. In some cases, they even surpass human abilities. Machine intelligence can analyze big data faster and more accurately than a human possibly can. Even though they cannot think yet, they see, sometimes better than humans (read our computer vision and machine vision articles), they can speak, and they are also good listeners. Known as “automatic speech recognition” (ASR), “computer speech recognition”, or just “speech to text” (STT) enables computers to understand spoken human language.
Speech recognition and speaker recognition are different terms. While speech recognition is to understand what is told, speaker recognition is to know the speaker instead of understanding the context of the speech that can be used for security measures. These two terms are confusing and voice recognition is often used for both.
In 1950s, system for single-speaker digit recognition developed by three Bell Labs researchers had the capacity of ten words.
Graduated from Stanford university, Raj Reddy tried to develop a system that can recognize continuous speech unlike the previous system that requires pauses between each word. Raj designed the system to enable spoken commands for chess.
Soviet researches developed dynamic time warping algorithm that has 200-word vocabulary around the same era. The speech was processed by dividing it into short frames, and The DTW algorithm signal process each frame as a single unit.
In 1980s, a voice activated typewriter called Tangora was created by IBM. Tangora could handle a 20,000-word vocabulary.
The developments are correlated with the hardware. In 1970s, the best computer has 4 MB RAM and it takes almost two hours to decode 30 second speech with these computers. As hardware problems were solved, researches could tackle harder problems such as larger vocabularies, speaker independence, noisy environments and conversational speech.
How it works
Computers needs digitized data to process and analyze. Analog-to-digital converter (ADC) translates the vibrations and analog waves into digital data that can computer analyze. Voice recognition software separated the signal into small segments to match these segments to known phonemes in the related language. The program compares the phonemes words in its built-in dictionary to determine what can be told.
Nowadays, voice recognition systems are using statistical modeling systems that uses complex probability and mathematical functions to determine the most likely outcome. The most common methods that are used for voice recognition are Hidden Markov Model and neural networks.
HMM enables to combine information like acoustics, language and syntax in a unified probabilistic model. In this model, program score each phoneme that will come after another to predict best next possible phoneme. This process gets harder in phrases and sentences because the system must understand the start and end of the words. When speech gets faster, voice recognition programs can misunderstand. To give an example here is breakdown of the two similar phrases:
- r eh k ao g n ay z s p iy ch
- “recognize speech”
- r eh k ay n ay s b iy ch
- “wreck a nice beach”
None of the voice recognition systems is perfect. Performance of the voice recognition systems is evaluated based on two criteria; speed and accuracy. Accuracy is rated with word error rate (WER). Accuracy may decrease depending on some factors. Some of them are hardware problems like low-quality sound cards, low-quality microphones, or insufficient processor. Speech recognition require clean audio and processing power to run the statistical models.
Other than hardware problems, overlapping speech can reduce the accuracy. In the meetings with multiple speakers that constantly interrupts their talks, voice recognition programs often fail to perform efficiently.
Homonyms are also a challenge for voice recognition systems. The words like “there – their”, “air – hair” and “be – bee” that are pronounced similarly but have different meanings are hard to understand from only the sound. To increase the accuracy, extensive trainings and statistical methods are used.
Everybody know Siri, the smart assistant of iPhone users. Siri is the most common example of voice recognition application. The other assistants like Microsoft’s Cortana or Amazon’s Alexa are the best examples of voice recognition-powered programs. Or maybe some of you can recall Jarvis from Ironman.
I guess many of you did use the Google’s voice to learn the true pronunciation of a word from Google translate. In that case, natural language processing is also used with voice recognition.
YouTube also uses speech recognition to automatically generate subtitles for the videos. When you upload a video that includes speeches or talks, YouTube detects it and provide a transcription. You can also have the minute-by-minute text of the transcribed speech.
There are many applications that voice recognition is implemented. Even in health care, voice recognition is used. Doctors can determine a person’s mental state whether he/she is depressed or suicidal by analyzing his/her voice.
- Automatic subtitlingwith speech recognition (YouTube)
- Automatic translation
- Court reporting(Realtime Speech Writing)
- eDiscovery(Legal discovery)
- Hands-free computing: Speech recognition computer user interface
- Home automation
- Interactive voice response
- Mobile telephony, including mobile email
- Multimodal interaction
- Pronunciation evaluation in computer-aided language learning applications
If you need to incorporate voice recognition in your products or services,
How can we do better?
Your feedback is valuable. We will do our best to improve our work based on it.